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Preface 


In 2022, the annual joint workshop of the Fraunhofer Institute of Optronics, 
System Technologies and Image Exploitation (IOSB) and the Vision and Fusion 
Laboratory (IES) of the Institute for Anthropomatics, Karlsruhe Institute of 
Technology (KIT) was hosted again in a Black Forest house near Triberg. 


For a week from the 31st of July to the 5th of August, the PhD students of 
both institutions delivered extended reports on the status of their research and 
participated in heated discussions on topics ranging from computer vision and 
optical metrology to usage control, control theory and neural networks. Most 
results and ideas presented at the workshop are collected in this book in the 
form of detailed technical reports. This volume provides a comprehensive and 
up-to-date overview of some of the research programs of the IES Laboratory 
and the Fraunhofer IOSB. 


The editors thank Arno Appenzeller, Jonas Vogl, Paul Wagner and Zeyun Zhong 
for their efforts resulting in a pleasant and inspiring atmosphere throughout 
the week. We would also like to thank the doctoral students for writing and 
reviewing the technical reports as well as for responding to the comments and 
suggestions of their colleagues. 
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Abstract 


Machine learning systems are often hard to investigate and intransparent in their 
decision making . Explainable Artificial Intelligence (XAD tries to make these 
systems more transparent. However, most work in the field focuses on technical 
aspects like maximizing metrics. The human aspects of explainability are often 
neglected. In this work, we present personalized explanations, which instead 
focus on the user. Personalized explanations can be adapted to individual users 
to be as useful and relevant as possible. They can be interacted with to give 
users the ability to engage in an explanatory dialog with the system. Finally, 
they should also protect user data to increase the trust in the explanation system. 


1 Introduction 


Artificial intelligence and machine learning have become extremely popular 
technologies that are widely used because of their many advantages. However, 
learned models like neural networks also have some major disadvantages, espe- 
cially their lack in transparency. During training, the models learn correlations 
from the training data that enable them to make predictions on unseen data 
and take decisions. What exactly a model learned, what they pay attention 
to and how they make decisions is however hard to comprehend. The field 
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of explainable artificial intelligence (XAJ) tackles this problem and tries to 
make learning systems more transparent. The goal is to create explanations for 
systems that help users or operators to better understand the models and their 
inner workings|2]. 

However, much of the research in the field is very technical, neglects the 
human aspect of explainability and only relies on researchers intuitions. Many 
works concentrate on technical aspects and try to maximize metrics that are not 
validated by user studies or grounded in psychology. Papers focused on users 
are mostly focused on user interfaces and not on the underlying algorithms[12] 


rom. 


In this work we present ideas for XAI methods that are better adapted to users 
to make them more useful and relevant. The first aspect is that methods should 
be individualized to the user to make them more helpful. Users should have the 
ability to customize an explanation in order to adapt it to their needs. Existing 
methods of interaction and individualization are presented in Section [2] A new 
approach for individualized and interactive explanations will be discussed in 
Sectionß] Sectionß. I]explains the approach and focuses on the individualization. 
However, users have to be able to interact with the system to get explanations 
they understand and are relevant for them. Methods of interaction with the new 
approach are shown in Section 3.2] Individualized explanations do however 
require personal data of the user in order to personalize the explanations. In order 
to build trust with the system, the user data has to be protected. Explanations 
can also help users to understand what kind of data is needed for a system to 
function properly. Concepts for data protection and data minimization will be 
shown in Section [4] 


2 Background 


There are different existing approaches in the literature to interact with XAI 
systems. The first one is to give the user the option to generate multiple 
explanations[15]. By generating multiple explanations the user gets different 
view points and has a better chance of understanding them. This can be 
done by generating multiple explanations of the same type or explanations of 
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different types. The user can also get the option to change input data[5]. By 
changing the instances that are explained by the system the user can get a better 
overview over the feature space and the behavior of the system. Other works 
evaluate the interaction with graphical representations or user interfaces|[6]. They 
investigate how different visualizations or interfaces help users to understand 
the explanations. Another way to interact with an explanation system is through 
an interface using natural language processingl4]. This way the user can use 
natural language to interact with the system which makes it much more suited 
for end users with little technical knowledge. All these approaches leave the 
explanation system itself untouched and only build different user experiences 
around them. By interacting with an explanation system, the explanations will 
also be individualized on a basic level. However, explanations can also be 
individualized explicitly. DiCE[14] can generate counterfactual explanations 
that are diverse, meaning that different explanation instances are different from 
each other. The method can also be used to set feature constraints that are used to 
ensure feasibility of the explanations but can also be used to adapt explanations 
to individual users. 


3 Individualized Explanations 


Individualized explanations should be adaptable to the use case as well as the 
individual user. The explanations can be adapted by an admin or professional 
user or the end user of the system itself. Different aspects can be considered when 
adapting explanations. One aspect is general knowledge or world knowledge 
as well as knowledge about the use case in which a system is deployed. In a 
medical use case for example, other aspects are relevant compared to a financial 
use case. Different features are important in different contexts and different 
applications so the explanations have to reflect that. Explanations should also 
be adapted to the user group. Different user groups have different abilities and 
knowledge levels in machine learning and the application domain. Machine 
learning experts, domain experts and end users have different capabilities and 
require different explanations. However, the explanations should not only be 
adapted to the user group but also to the individual user itself. To achieve this, 
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different sources of background knowledge can be used (see Section 3.1) or 
the user can be given the ability to customize the explanations by herself (see 


Section 3.2). 


3.1 Personalized Counterfactuals 


Counterfactual explanations are a form of explanation for machine learning 
systems. A counterfactual describes an alternative state in which some changes 
were made that lead to a different outcome. In machine learning, the factual is an 
instance whose prediction from a model should be explained. The counterfactual 
is an instance with small changes that lead to a different classification. For 
example, if a credit application is rejected a counterfactual explanation could tell 
that the application would have been accepted if the credit amount was lowered 
by a certain amount[13]. 


Counterfactuals are a local explanation method, which means that they explain a 
single data point or decision of a model in contrast to global explanations which 
explain the behavior of the whole model. They are calculated by searching 
for the closest instance from the one that should be explained that changes the 
prediction of the classifier. This can be done by random sampling]3], using a 
gradient[9], formulating the problem as an optimization problem[16] or with 
genetic algorithms[14]. Counterfactuals originate from counterfactual thinking 
which people engage in regularly[12]. Thus people are already used to the 
concept which makes these explanations especially user friendly and suited for 
non technical end users. 


It is however not obvious how different features with different value ranges 
and units should be treated when comparing counterfactual explanations. It 
is for example not possible to objectively compare what change in the credit 
amount equals to what change in the credit length. The idea behind personalized 
counterfactuals is to have a weighted distance metric to calculate the distance 
between a factual and different counterfactuals. The weights can be chosen by 
the user to represent her preferences. If for example the weight for the credit 
duration is low and the weight for the credit amount is high, a change in the 
credit amount will be penalized more and a change in the credit duration will be 
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preferred. This means that by changing the weights of the distance function the 
users can adapt the counterfactual explanations to their needs or preferences. 


Besides the weighting of features, users have some more ways to personalize 
the explanations. Some features are unchangeable like a persons place of birth. 
Users can tell the system to ignore such features in the explanations. Other 
options would be to restrict the value range in which a feature can be changed or 
make it only changeable in one direction, like an age which can only get larger. 
Ifitis a multi-class classification problem the user can also specify the class the 
counterfactuals should be classified as. 


In addition to options for single features, users also have the option to set global 
metrics to diversify the different counterfactual explanations shown to the user. 
These global metrics evaluate a set of different counterfactuals. By changing 
them, the user can get multiple similar explanations or more different ones. 
Users can also adjust a weight that measures in how many columns changes 
were made. With these global metrics, the user can configure the overall set of 
different counterfactual explanations 


All these settings can be adjusted by an administrator to represent world 
knowledge or adapt the explanations to a specific use case. The administrator 
will set these options once for a specific application. In a second step, the end 
user can adjust all settings or a smaller subset in the application itself. This is 
done to personalize the explanations to the specific user. 


After all the weights and metrics are set, the search for personalized counterfactu- 
als is done with an evolutionary algorithm. Features of the original instance are 
randomly changed, excluding the ignored features. The new instances generated 
in this way are passed to the model to check if they are counterfactuals or if they 
are classified as the wanted class. Then, the distance from the original instance 
is measured by a weighted Euclidean distance using the previously specified 
weights. The set of instances is also evaluated by the global metrics. The best 
instances are chosen and changed again. This process is iterated until the rating 
by the metrics does not change anymore. This approach is completely model 
agnostic because it uses no internal information about the model that should be 
explained, like gradients. Only predictions of the model are used to check if 
instances are counterfactuals or if they are classified as the wanted class. 
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3.2 Interactive Explanations 


An important aspect to adapt explanations to users is to enable users to interact 
with the system. This way, users can influence explanations and customize 
them to their needs, be it their knowledge level in a certain domain or their 
technical expertise. Explanations between humans are also often given in a 
form of dialog. The explainee can ask about things she does not understand 
or for details on the given explanation. The explainer can then give additional 
information, reformulate the given explanation or come up with a new one. 
Interaction and individualization go hand in hand because users are only able 
to individualize their explanations if they are able to interact with the system 
and users that can interact with an explanation system will try to get a better 
understanding by adapting the explanations to their needs. The personalized 
counterfactual explanations can also be interacted with in several ways. The 
first way is to change the weights for different features. This influences how 
much a feature is changed in the generated counterfactuals. Users can also 
exclude features from the search by marking them as unchangeable. This helps 
to only generate satisfiable explanations and not ones that are impossible or 
unrealistic by for example demanding to change a persons race. The next way to 
interact with the system is to adjust the global metrics that compare the set of 
generated counterfactuals. With these metrics, the users can get more or less 
diverse explanations and influence how sparse the explanations are, meaning 
how many features are changed. The final way is by changing the target class. 
With this setting, users can tell the system to generate explanations for a specific 
target class. By looking for counterfactuals with a specific target class, the user 
can see what changes to an instance are needed to reach the desired class. 


4 Protection of Personal Data 


The previous Sections showed how explanations can be personalized and 
interacted with to better meet the user’s needs. However, personalization also 
has a drawback: it requires personal data from the users. Without some form 
of data about the user of an explanation system it is not possible to adapt 
explanations to the user. Personal data underlies the European Union’s General 
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Data Protection Regulation (GDPR)[8] as well as the Artificial Intelligence 
Act|7]. These regulations demand the protection of personal data. One way to 
protect user data in any application is by the use of trusted computing methods. 
An approach to this is shown in section [4.1] Explanations can also be used to 
show users what influence sharing or not sharing some data has on a system. 
Users can see how system behavior changes with their decision and find a 
configuration that works for them. With such explanations users can make 
informed decisions on what data they want to share with a system and what data 
they want to keep private. This way users are able to minimize the data they 
have to share with a system. The idea for using XAI to explain the effect of 
sharing data is shown in Section |4.2| 


4.1 Explainable AI and Trusted Computing 


Users may be uncomfortable with sharing their data with a system that they do 
not understand and cannot trust. XAI can explain a systems behavior to a user 
but to get relevant explanations users often have to share their data first in order 
to get explanations that are relevant to their situation. In order to keep personal 
data safe and contribute to increasing the trust in the system trusted computing 
methods can be used. Two trusted computing technologies that are useful for 
this are Trusted Platform Modules (TPMs) or Trusted Execution Environments 
(TEEs). TPMs are trusted hardware modules that can verify the state of a system. 
TEEs are a separate part of the processor that enables secure data processing and 
is not accessible even by the operating system. These methods can be used to 
secure an explanation system and make it more trustworthy either by verifying 
that the system is in a trustworthy state with TPMs or by executing code on 
TEEs. The combination of these technologies can create trust in the system 
through trusted computing methods and trust in the underlying machine learning 
model through XAI. 


4.2 Explainable AI for Data Minimization 


Some applications require different kinds of data from users. For example, a 
health app may be interested in health data, location data and general information 
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about the user. However, users may not want to share all this data with an 
application. In order to let users make an informed decision about the data they 
want to share, they have to know how the system behavior changes if they decide 
to keep some of the data private. Here, we will present a concept on how XAI 
can be used to generate such explanations. 


The idea is to use a combination of SHAP{I 1} and counterfactual explanations. 
SHAP is a XAI method building on Shapley values. It calculates a feature 
importance by omitting features from an instance and replacing them with values 
from random instances from the training set. The predictions of the model with 
the random feature values are averaged and compared to the result of the original 
instance. This way, the influence of the feature value on the original instance 
can be calculated. This idea can be combined with counterfactual explanations 
explained in Sectionß.1] The combination of the two methods should be able to 
explain users how not sharing some of their data would influence the system 
behavior. 


5 Summary 


In this work, we presented some ideas for making explanations for Al systems 
more relevant to users. At first, methods for individualizing explanations were 
shown that make it possible to adapt explanations to individual users. Afterwards, 
existing principles of interaction with explanation systems were presented and it 
was shown how users can interact with personalized explanations. Interaction 
and individualization are interrelated because users have to interact with a system 
in order to individualize an explanation. Methods for protecting user data in the 
explanation process through trusted computing were shown. At the end, an idea 
on how to minimize the data a user has to share by providing explanations was 
presented. 
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Abstract 


In this report we present a pipeline for static coverage planning of known 
objects, which is an important task in the field of mobile robot based inspection. 
We analyse the main components of the Structural Inspection Planner |1] and 
embed an improved implementation into a autonomous flight pipeline for UAVs. 
Triangle mesh models serve as input for an initial viewpoint sampling. Inspection 
quality and path length are optimized by formulating the viewpoint sampling 
as constraint QP. We thoroughly evaluate the ROS-based inspection pipeline 
on synthetic and real models using a Gazebo simulation. Our experimental 
evaluation shows that while an efficient inspection trajectory could be generated 
for most of the tested models, the result is very dependent on regular and well 
formed input models. 


1 Introduction 


Automated structural inspection tasks have become increasingly important in 
the last couple of years. As facilities grow larger, it is difficult to guarantee a 
continuous and smooth operation without generating a high manual workload. 
The field of automating those operations is called Non-Destructive Inspection 
(NDD 9. Utilizing mobile robots to perform NDIs can keep workers out of 
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| 
I dmax! 


target inspection area 


Figure 1.1: Outline of a UAV inspection flight. The drone with a fixed camera is following the 
inspection trajectory (blue), while tracking its exact position in the world coordinate frame W. Blue 
points on the trajectory mark viewpoints for specific triangles. The position of the viewpoint gs is 
dependent on a set of constraints, for instance the one constructed by the parameters dmin and dmax 
given in green. 


dangerous situations and save a lot of time and resources. Especially with 
the availability of UAVs (Unmanned Aerial Vehicles), high risk operations 
such as inspection of buildings, bridges or oversea structures become feasible. 
Nooralishahi et al. recently provided an in-depth review on the current state of 
UAVs in NDI 9. Oftentimes, it is desirable to create an inspection on the basis 
of an existing model of the respective structure. This allows to quantify the 
structural damage on the surface compared to the existing model. However, such 
an approach requires the mobile inspection robot to approach specific viewpoints 
with a high accuracy in order to inspect the correct target area. This is very 
difficult to achieve by a manual UAV operator and leads to artifacts in the imagery 
generated during the inspection flight due to the accumulation of positioning 
errors. These issues can be dealt with by automatically generating suitable 
viewpoints and a corresponding inspection trajectory. This way, consistent 
inspection imagery can be generated in an automated fashion through a number 
of repetitions. 


Therefore, we propose a framework which allows for inspection planning of 
structures using different kinds of mobile robots, even though we focus on UAVs 
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here. A schematic visualization of the inspection process is given in Figure[1.1] 
More detailed building blocks of the proposed framework are depicted in 
Figureß. 1] The inspection planning approach conceptually bases heavily on the 
Structural Inspection Planner (SIP) (i). which has been developed to calculate 
inspection trajectories for fixed wing drones as well as UAVs for existing triangle 
meshes. Our work identifies weaknesses of the SIP and re-implements it with 
certain modernisations and adoptions. We also abstract the planner component 
in a way that it can be used inside a larger simulation framework. This allows 
us to easily perform tests and simulation flights for reconstruction purposes 
using the framework. Section] gives a more detailed overview on the structural 
inspection planner and other related work, before we present the main structure 
of our inspection pipeline in Section We apply the planner to different 
artificial and real models in a Gazebo simulation. This allows us to easily test 
different UAV, scenario and sensor setups and also account for measurement 
uncertainties. The qualitative and quantitative evaluation of these experiments 
is presented in Section] Finally, we conclude our work, identify drawbacks 
and comprehensively describe possible future adoptions in Section|4] 


The proposed framework is built within the Robot Operating System (ROS) 
connecting the different components as visualized in Figure[3.1] It is widespread 
in the robotic community as it comes with sensor drivers, state of the art 
simulation frameworks (Gazebo (7) and lots of prebuilt algorithms for perception 
and navigation. ["[?|This allows us to abstract the inspection planner into a single 
node as modular component in a greater UAV stack developed in a previous 
work (5). This stack is supported by a Gazebo simulation of a UAV platform 
running the px4 software stack P| Px4 [8] is an open source autopilot running 
on various drones. It comes with Software-in-the-loop SiL and Hardware-in- 
the-loop HiL features which allows ours scenarios to be simulated in a realistic 
way. 


1 


https://ros.org/ 


?'https://gazebosim.org/ 


» https: //prä. io/ 
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2 Related Work 


UAVs are a natural choice for structural inspection as they allow for agile 
movements in complicated and cluttered environments. This led to extensive 
research for flight planners in the last couple of years. Oftentimes, the desired 
goal is to create a reconstruction of a previously unknown environment. In this 
work, we focus on model-based inspection, where we require an accurate mesh 
of the target. Previous works with these conditions are rare. 


The work by Yan et al. uses a multistage approach to generate a coarse reconstruc- 
tion in a first step and then samples viewpoints for a high quality reconstruction 
in a second step (15). Such an approach targets large scale reconstruction as 
prior knowledge of the target shape is not utilized. Instead, the skeleton is being 
build with a costly Structure-from-Motion technique. Schmid et al. provide an 
online informative path planner, where only one inspection flight is required. It 
uses an RRT*-inspired exploration scheme with object coverage as optimization 
target. They showed a TSDF-based reconstruction of previously unknown target 


areas [12/4]. 

The Structural Inspection Planner (SIP) is one of the few frameworks which 
explicitly uses triangle meshes as input to sample a viewpoint trajectory. It 
samples viewpoints for each triangle in the mesh. Viewpoints have geometrically 
derived constraints, which are solved as global optimization problem. In a next 
step, all sampled viewpoints are connected in an efficient way by interpreting the 
trajectory generation as Traveling Salesman Problem. The steps of viewpoints 
sampling and trajectory generation via TSP are combined in an iterative fashion 
until a minimal-length path is found. In practical application however, we found 
the planner to sample not admissible viewpoints or not converging at all for 
difficult meshes. The initial viewpoint sampling is highly dependant on the 
structure of the triangle mesh. Jing et al. (6) also uses an explicit model as 
input representation. However, they require a voxelized version of the model 
to first sample a suitable inspection area (via-points) using voxel dilation of 
the target. Suitable path primitives are then randomly sampled and verified 
by estimating the target visibility at each point. In a final step, a graph based 
method is used to generate the final UAV trajectory from the path primitives. 
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For larger objects, the operation on single voxels can become costly, so that the 
visibility calculation is not feasible for all admissible via points. 


A recent work by Debus and Rodehorst on the inspection of buildings 
provides evaluation metrics for path planning approaches. The corresponding 
Bauhaus Path Planning Challenge comes with a framework implementing these 
metrics for a set of models, which will also be targeted in this report. Despite 
typical evaluation metrics such as path length and runtime, they also focus on 
measurable reconstruction quality and surface resolution. 


3 Planning Pipeline 


In the following, we briefly describe our pipeline architecture before we discuss 
the inspection planner design in greater detail. The main building blocks of 
the pipeline are depicted in Figure]. 1] We embed the planner into our UAV 
framework presented in a previous work [5]. We design the main building 
blocks “Viewpoint Sampling” and “Trajectory Generation” to be components 
of a Planner Manager, as this manager also takes care of the sequential control 
for replanning and avoiding obstacles during the mission. In the original 
work i). both blocks of viewpoint sampling and trajectory generation were 
supposed to run multiple times in an alternating scheme. However, in the 
experimental evaluation we show that the viewpoint arrangement resulting from 
the initial sampling iteration is already an intuitive result providing full coverage. 
Therefore, we mostly apply only one step of sampling and trajectory planning 
routines. This reduces the overall planning time at the cost of longer trajectories. 


3.1 Viewpoint Sampling 


We use the same optimization scheme as Bircher et al. in (1). as we iterate 
through all triangles in the mesh to generate one viewpoint each. A viewpoint 
needs to fulfill all intrinsic and extrinsic constraints. The first refers to the 
visibility of the targeted triangle, while the latter refers to boundary constraints 
given by the user. Opposing to (1). we increase the flexibility of the optimization 
problem by allowing an arbitrary number of constraints in the solver. This also 
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Figure 3.1: Structure of the whole simulation pipeline. The relevant blocks for inspection planning 


GAZEBO 


are highlighted in blue. The system consists of individual components for localization, mapping and 
control, which enable to fly the inspection trajectory within a simulated or real environment. The 
pipeline works with system based on the px4 [8] firmware, which implements the mavlink protocol 
for communication. 


allows to build more generic constraints for polygons instead of triangles. We 
first calculate the normal ay of each of the polygons with vertices {21,...,2n}, 
as well as all edge normals nı,...,r„. In addition, we need to specify the 
camera parameters horizontal and vertical field of view as well as the pitch. 
Using these values as input, we build the following set of constraints. 
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pitch 


FOV, 


Figure 3.2: Visualization of the camera parameters FOV and pitch as well as the notation for 
triangle vertices and normals. 


The left and right parts of the above equation ubA = {2min, Ymin, Zmin } and 
IbA = {max Ymaz, max} quantify the admissible viewpoint sampling space. 
The first n constraints force the viewpoint g* to be sampled “in front” of the 
triangle, as the projection of the hyperplane normal n of the respective edge 
onto the vector spanned by the viewpoint (g* — a;)' is required to be positive. 
The next constraint forces the distance of the viewpoint to be in [dmin, dmax] by 
restraining projected distance of the triangle normal ay. Finally, the last four 
constraints are exactly the field of view constraints from (1). They guarantee 
the viewpoint to lie inside the horizontal and vertical FOV of a camera with a 
specific pitch. To accomplish this, four hyperplane normals Mupper, Mowers neft» 
Night With respective anchor points x<.> are sampled using the camera pitch 
and FOV. A more detailed derivation of these constrains can be found in the 
original work of SIP [1]. 


The optimization objective is to minimize the distance between two consecutive 
viewpoints gp and gs, which optimizes the total path length. In the original 
formulation from | 1], the viewpoint sampling was meant to run multiple iterations, 
such that the squared distance between op , g? and g and their previous iteration 
k — 1 is minimized. The quadratic problem is then solved using gpOASES Bl. 
resulting in a optimal position gop for the current iteration k. In a next step, the 
rotation is sampled by performing an explicit visibility analysis. In a simple 
UAV scenario, pitch and roll of the rotorcraft are always fixed, while the yaw 
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angle y is subject to change. For the sampled position and all possible yaw 
angles in [0, 27] with a step size of 0.2rad we check if (a) all distance constraints 
are met, (b) all vertices lie withing the FOV of the camera, and (c) there are no 
collisions on the way from camera to the polygon. The collision check can be 
invoked by performing a simple raycast which allows it to consider other static 
obstacles in the scene. The first position and orientation pair which passes all 
requirements, is selected as viewpoint. 


3.2 Trajectory Planning 


Given the set of N viewpoints {gı, g2,..., gn}, one for each triangle, we now 
connect them into a single shortest path trajectory Topt. Connecting such as set of 
“must visit”-points is a typical application for the Traveling Salesman Problem 
(TSP). Each viewpoint must be visited exactly once while optimizing the overall 
path length. Even though TSP it is a NP-hard problem, efficient solvers exist for 
the comparatively small number of viewpoints. We use a TSP-solver developed 
as part of the Google OR-tools [10]. It allows to use a custom distance matrix 
between all viewpoints as input. This allows it to implicitly embed more metrics 
such as the change of yaw or inspection angle into the optimization. However, 
in the current version we simply use an Euclidean distance matrix in order to 
optimize for path length as primary objective. The TSP is then solved using 
the guided local search heuristic which is considered one of the most efficient 
sampling heuristics for routing problems. 


4 Experimental Evaluation 


The utilized planning framework allows to use the px4 Software-In-The-Loop 
component to run experiments with Gazebo as simulator. We tested the planning 
procedure on a number of different models. Figure]4. I|shows some ofthem, either 
taken from the Bauhaus Challenge BI. coming with the SIP-implementation 
or created from real-world objects on our premises. Even if Gazebo is not 
capable of rendering the environment in a photorealistic way, it allows to test 
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Parameter Name Units Default Explanation 
1 Planner 
Unis pany meter [6.0, 10.0] distance constraints 
Min. incidence angle degree 10° Z(an,N<.>) c.f. Fig. 
2 Rotorcraft 


Max. velocity m/s 2 
Max. angular velocity rad/s 0.5 
3 Space boundary 


Max. space size meter [200, 200, 50] x, y and z size 
Space center meter [0,0,0] 3D coordinate [x, y, z] 
6 Camera 
FOV degree [120°,120°)  [horizontal, vertical] field of view 
Pitch degree 30° Pitch angle of the camera 


Table 4.1: List of the most important parameters and their respective default values within the 
inspection framework. 


results of the view-point sampling using different sensor setups and environment 
data. 


y 


(a) Bridge Pier (b) House (c) Hall (d) Wall Model 


Figure 4.1: Exemplary triangle mesh models used for experimental evaluation. The first two were 
taken from the Bauhaus Path Planning Challenge Bl. while the last two have been created for simple 
ablation studies. 


The framework has various parameters, which heavily influence the experimental 
results. We tried to use similar defaults as in [1]. Table[4. 1]gives an overview 
over the most important parameters and their default values. 


We quantitatively evaluate the planners in different configurations on a set of 
standard metrics. We leverage the planning time, path length and mean yaw 
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Model Mesh Size Path Length Mean Yaw Planning 


facets [m] Rate [°] Duration [s] 

Hall 148 216 14.36 2.60 
Artificial House 154 152 14.23 2.51 
186 408 21.17 2.63 

Bridge Pier 965 441 11.09 8.85 
3930 2011 0.28 262.02 


Table 4.2: Quantitative evaluation on the three main models hall, house and bridge pier. Mesh size 
denotes the number of triangles and planning duration the total planning time for all steps. 


rate as main metrics. We also verify the number of rejected triangles. The 
path length is specified as £ = > d;ii+1 where N is the total number 
of viewpoints and d;_,;+1 is the distance between the ieh viewpoint and the 
subsequent one. The mean yaw rate AY = $ DAF || Aw;+i41|| specifies the 
mean change in yaw angle over time, with Ay;_,;11 being the change in yaw 
between two consecutive viewpoints. 


Table shows the results for some of the models in different resolutions. All 
metrics are dependant on the number of triangles in the mesh. The main cause 
for an increasing path length are outliers in the viewpoint sampling, while the 
increased planning time results mostly from the exponential increase in TSP 
solving time. The decrease in yaw rate simply follows from the fact that the 
many viewpoints are interpolations of viewpoints from the lower resoluted mesh 
and thus not contributing to any turns of the UAV. 


We qualitatively inspect the generated inspection trajectory on some of the 
models in Figure[4.2] For the simplest wall model in Figure[4.2(a)|the viewpoint 
generation works as expected when running one iteration of viewpoint sampling. 
The remaining images in Figure[4.2|show the bridge pier in different resolutions. 
Even though the generated flight plan looks regular in general, more outlier 
viewpoints are generated for the higher resoluted meshes. The reason for this 
is typically the result from non optimal QP outputs for some difficultly placed 
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(a) Wall (b) Bridge Pier (c) Bridge Pier (d) Bridge Pier 
186 facets 965 facets 3930 facets 


Figure 4.2: Generated paths for the wall model and for different resolutions of the bridge model. 
The light green lines are the input mesh, blue the trajectory and the yellow arrows denote the 
viewpoints and their direction. 


triangles. The solver is configured in a way, that some constraints may be relaxed, 
if a global optimum cannot be found. One approach to prevent such outliers from 
being sampled is to restrict the dmax parameter for the admissible sampling space. 
In addition, it could be considered to not sample one viewpoint per triangle 
but to combine multiple similar viewpoints in a later step to reduce the overall 
path length. This could also allow to filter sampled outlier viewpoints, which 
could lead to a smoother trajectory than in Figure/4.2@] We observe a similar 
behaviour for the models in Fi gure[4.3] We use red to indicate triangles for which 
no viewpoints could be generated. As the camera pitch is fixed and the UAV 
cannot fly below the ground, all ground triangles are marked red in Figure 
and therefore not participating in the trajectory planning. Figure [4.3(b)]shows 
the inspection path on the simple hall model and additionally marks the UAV 
odometry from a simulated flight in red. 


Despite of sampling some outlier viewpoints, we can observe a successful 
coverage for all tested models. This can also be verified by performing a 
reconstruction using the recorded images from the simulation. Figure[4.4]shows 
the result of this procedure on the Hall model. We post-processed the images 
with colmap to obtain the sparse and dense reconstruction results. Note 
that the inspection planning itself does not target 3D-reconstruction applications 
in particular. We do not ensure within the viewpoint generation that a triangle 
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(a) House (b) Hall 


Figure 4.3: Generated paths for the artificial house and the hall. In addition to the trajectory we 
mark rejected triangles in red. 


must be observed from two distinct positions. Nevertheless, the generated paths 
often allow a dense reconstruction as two consecutive viewpoints oftentimes 
target neighboring triangles on the mesh, resulting in a stereo baseline for both 
triangles. 


Fa 


A = 


e RAT 


(a) Textured Hall Model (b) Generated Inspection Plan (c) Reconstructed Model 


Figure 4.4: Application of the inspection planning for reconstruction purposes. The first image (a) 
shows a textured version of the model from Figure[4.1()] We then use the generated inspection 
plan to fly it in simulation and perform a reconstruction using the saved images (b). Finally, (c) 
shows the dense reconstruction. 


5 Conclusion and Future Work 
In this work, we presented and evaluated a structural inspection pipeline 
for mobile robots using triangle meshes as input. We showed that intuitive 


inspection trajectories could be generated for a set of different models. In order 
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to thoroughly test the inspection, we embedded it into a UAV autonomy pipeline 
with simulation capabilities. Using that simulation, we also applied the routine 
for the purpose of 3D-reconstruction. We identify the availability and variability 
of input data as main drawback in the presented approach. Given an arbitrary 
input mesh, it is a very error-prone pre-processing step to transform it into a 
regular triangle mesh with a desired amount of facelets. On the other side, 
the number of facelets is the only parameter to control the initial number of 
sampled viewpoints and thus the runtime required for the initial sampling step. 
Some approaches, such as ACVDF?resulting from exist to simplify existing 
meshes but as soon as the geometries get complex and non-convex, we were not 
able to produce a regular mesh (see[5.1(b)). 


(a) Excavator (b) Upper Carriage Mesh (c) Voxelized Object 


Figure 5.1: Example for an excavator model, which is difficult to tackle using a triangle mesh based 
inspection scheme. Using a dynamically generated voxelized version of the model such as in (c) 
might improve the viewpoint sampling and also allows resampling for different joint positions. 


One way to overcome these limitations in the future is the usage of a different 
input modality. For instance, one could use a voxelized structure of the mesh 
such as visualized in[5.1()| which is comparatively easy to generate even for 
dynamic joint positions. This approach would require a strategy to divide the 
voxel structure into different regions as the workload for sampling one viewpoint 
per voxel would be too high. This also raises the question, if the formulation 
as QP is even necessary if we only run one iteration of sampling and planning. 
We can influence the resulting inspection trajectory either by running multiple 


4https://github.com/valette/ACVD 
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iterations or by adjusting the input modality in a way that less viewpoints are 
sampled in the first place. 


Further extensions to the current approach are conceivable. It could be useful to 
explicitly encode inspection or reconstruction quality into the optimization. The 
first would require some dynamically generated distance constraints in order to 
achieve a user-definable ground sampling distance (GSD). The ladder requires 
that a triangle can be inspected from at least two viewpoints, so ideally each 
viewpoint must encode the constraints for multiple triangles. 
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Abstract 


When developing versatile machine learning systems, catastrophic forgetting 
poses a significant challenge, by which models trained on tasks sequentially 
suffer significant performance drops when put to use on earlier tasks. In spite 
of the prevalence of catastrophic forgetting, its underlying cause and process 
are still poorly understood. Commonly, the performance of a continual learning 
algorithm is only measured using accuracy on the test set of the tasks within a 
sequence. While test set accuracy is useful for comparing different continual 
learning algorithms on their respective benchmarks, they cannot provide insights 
into how and where the model is affected by catastrophic forgetting, as they 
only provide final accuracy metrics. Therefore, we study how comparing 
representations, re-training schemes and layer stitching can help to reveal effects 
and causes of catastrophic forgetting. 


1 Introduction 


A desired property of many machine learning models exposed to a changing 
environment is the ability to progressively acquire new knowledge without 
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negatively interfering with previously learned knowledge. A major challenge 
in achieving this goal is to overcome catastrophic forgetting, where the model 
forgets knowledge learned from previous tasks while learning a new task (8) {16}. 
This is especially relevant in constantly changing environments, like automated 
driving, in which a model for semantic scene parsing has to adapt to new unseen 
objects, e.g., e-scooters, or different driving situations or new environments 
e.g. different countries or adverse weather conditions. Continual learning is 
a rapidly evolving field, aiming to overcome the limitations of catastrophic 
forgetting. Continual learning algorithms often attempt to overcome specific 
known causes of catastrophic forgetting like weight drift, activation drift, inter- 
task confusion and task the recency bias (15} (13). However, the performance of 
a continual learning algorithm is mostly measured using accuracy on the test set 
of the tasks within a sequence. While test set accuracy is useful for comparing 
different continual learning algorithms on their respective benchmarks, they 
cannot provide insights into how and where the model is affected by catastrophic 
forgetting, as they only provide final accuracy metrics. Therefore, the goal of 
this work is to demonstrate how causes and effects of catastrophic forgetting can 
be revealed with existing methods that are used to measure the representational 
similarity, weight distance and the inter-task confusion of a continually trained 
model. Additionally, we identify the limitations of the approaches that should 
be taken into account when using them. 


2 Related Work 


As the previously mentioned underlying effects of catastrophic forgetting cannot 
be measured exclusively by the accuracy achieved on the test sets, several 
methods were proposed to gain additional insights of the causes and effects of 
catastrophic forgetting. Mirzadeh et al. use Linear Mode Connectivity to 
show that multi-task minima are connected to continual learning minima by a 
linear path of low error on their respective tasks, while the individual single-task 
optima are not similarly connected. More recent work uses Linear Probing 
to investigate representational forgetting, which measures the difference in 
accuracy a optimal linear classifier achieves before and after introducing a new 
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task. They confirm the notion that the observed test accuracy of continual 
learning algorithms only allow restrictive insights into the model and that some 
methods perform better mitigating the effects of forgetting than the test accuracy 
indicates. Similarly, for semantic segmentation Decoder Re-Training is used to 
gain comparable insights and to measure the impact of inter-task confusion in 
class-incremental learning [10]. Central Kernel Alignment (CKA) introduced 
by Kornblith measures the representational similarity of neural architectures 
after training. Recently it has also been applied in continual learning to measure 
the shift of representations for previous tasks after training on a new task BIRO. 
Furthermore, the Dr. Frankenstein toolset proposed Csiszárik et al. (4). which 
measures the functional similarity of representations, was used to identify the 
causes of forgetting in class-incremental semantic segmentation [10]. Finally, 
ongoing research proposes several interpretability methods for deep learning 
models that help to explain why a model made a particular prediction B]. These 
methods can also be utilized in the setting of continual learning, but are not 
discussed in our work. As there is an apparent lack of work comparing the 
different methods, in this work we aim to evaluate the given methods on similar 
tasks to understand how they can complement each other. 


3 Preliminaries 


3.1 Effects of Catastrophic Forgetting 


A machine learning task T = { (£m, Ym) }*“_, consists of a set of M inputs 
x € X and corresponding labels y € Y. In classical machine learning the 
parameters of a model f,, are optimized by minimizing the negative gradient 
of the empirical risk g over T w.r.t to a loss function £. Most commonly g is 
approximated by calculating the stochastic gradient g on a mini batch T” CT 
with p; as the sampling probability for a training sample. 


:ja]= SY) pi VL(folxi) y) (3.1) 
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If p; is uniform for all training samples, the expectation of g is equal to g. 
However, in continual learning p; is not uniform, as the model f is sequentially 
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optimized on a sequence of tasks T1..x. So that p; is only uniformly distributed 
over samples belonging to the current task Tk. However, the goal is to minimize 
the empirical risk g over the entire sequence 71..x. Thus, when optimizing on 
data of task T;,, optimization disregards other task distributions and the model 
is optimized without regard of previous data. Therefore, the sample distribution 
for all samples of task is {T;|t 4 k} is p; = 0, which leads to catastrophic 
forgetting. 
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There are four main effects how catastrophic forgetting manifests itself in 
incremental learning [15]. 


e Weight Drift: During optimization on Tķ, the weights of the model that 
were relevant to the previous task Tk—1 are updated without regard to the 
previous task, resulting in drop of performance on task Tk—1. 


e Activation Drift: A change of the weights of the model directly results 
in a change of internal activations and to the output of the model. While 
activation drift is a direct result of the weight drift, activation drift 
additionally also takes the input data distribution into account. 


e Inter-task confusion: The objective in class-incremental learning is to 
correctly discriminate between all the observed classes. However, as the 
classes are never jointly trained, the learned features are not optimized to 
discriminate classes from different tasks, as shown in Figure[3.1] Related 
to inter-task confusion are task-specific spurious features that can also 
arise in domain incremental learning (13). 


e Task-recency bias: In the class-incremental setting, the model is opti- 
mized to predict new classes without regarding the old classes. This leads 
to a strong bias for the most recently learned classes, especially in the 
classification layers of the models. 
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Figure 3.1: Visualization of task confusion in class-incremental learning. As classes of Task 1 (car 
and bicycle) and classes of Task 2 (bus and motorcycle) are never trained at the same time, the 
classifier never learns to discriminate between bicycle and motorcycles, which causes inter-task 
confusion. 


3.2 Notation 


A training task T = {(2m,Ym)}M_, consists of a set of M images x € X with 
X = R#*W*3 and corresponding labels y € V. Given the task T an artificial 
neural network learns a mapping function f : X — Y that maps the input space 
to the output space. The neural network consists of N consecutive layers 
so that f = gy ©... o g1, where gn : An-ı — An are mappings between the 
activation spaces A„_ı and An with Ag = X. In continual learning f is not 
trained on a single task T but on a sequence of tasks T}. We denote the neural 
network that was successively trained on t tasks as f: = g:n ©... © g¢,1 With 
corresponding activations A; „n. Our goal is to evaluate methods that measure 
the activation drift between A; and A;_ı,n or the weight drift between the 
parameters of g+ n and 9:-1,n, which layer n is subjected to during continual 
learning. Furthermore, we define two incremental settings that define how 
subsequent task expand the first learned task: 


e Class-incremental learning, in which each new task extends the existing 
set of classes by a set of novel classes. 


e Domain-incremental learning, in which the classes stay the same, but 
the images of each task are obtained from a different distributions and 
therefore have distinct visual appearance. 
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4 Experiments 


Our study compares different methods to measure catastrophic forgetting in 
a class- and domain-incremental setup, because the effects of catastrophic 
forgetting differ vastly between these setups. In the domain-incremental setting, 
we train the model incrementally on Cityscapes [3] (CS) and then on the ACDC- 
Night [22] subset. ACDC and CS are both large-scale datasets for semantic 
understanding of urban street scenes for autonomous driving and share acommon 
19 class labeling policy, so that the increment is purely the change from day 
(CS) to night images (ACDC). For the class-incremental setting, we use the 
commonly used PascalVOC |7] dataset with a 15-5 split. The Pascal VOC- 15-5 
split is a two step incremental learning sequence, which consists of learning 15 
classes (1-15) in the first step To and the remaining 5 classes (16-20) in the 
second step T}. We evaluate all methods on the ERFNet architecture and 
compare different methods to train the models in an continual manner, namely: 
Fine-Tuning (FT), the prior-regularization method EWC and Replay. 


5 Activation Drift 


Methods in this section measure the activation drift between a model fo and fi, 
where the model fo is trained on To and fı is initialized with the parameters of 
fo and incrementally trained on 7}. This is done on a layer-wise manner so that 
the activations Ao,„ and Aı,n of layer n of the models fo and fı are compared. 
The current key methods to measure activation drift in neural networks are 
Centered-Kernel Alignment (CKA) and Layer Matching [10|{4]. In this 


Method Class-Incremental Domain-Incremental 
0-15 15-21 | Forgetting | Cityscapes | Night | Forgetting 
Fine-Tuning | 4.6 23.0 9.0 36.4 39.1 319 
EWC m) 28.1 10.1 23.8 40.2 27.8 28.1 
Replay 42.2 29.1 39.1 58.2 40.4 10.1 


Table 4.1: Comparison of EWC, Replay and Fine-Tuning in the class- and domain-incremental 
learning scenarios. Evaluation is run after training on the entire task sequence. 
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section we discuss how these methods measure activation drift in continual 
learning and what the difference between those methods are. 


5.1 Layer Matching with Dr. Frankenstein 


The Dr. Frankenstein toolset aims to analyze the similarity of representations 
in deep neural networks, by matching the activations of two networks on 
a given layer by joining them with a stitching layer 14. The goal of the 
stitching layer is to transform the activations of a specific layer of fo to the 
corresponding activations of a model fi. We measure the similarity of the 
learned representations by comparing the initial accuracy of the model fo with 
that of the resulting Frankenstein Network. The higher the resulting relative 
accuracy is, the closer the learned representations of the models are to each 
other. Previous work in continual learning omits the stitching layer and directly 
uses the activations of fo in f1, as the models are closely related, because fi is 
initialized with the parameters of fo [10]. The setups are displayed in Figure. 1] 
If the accuracy of the resulting Frankenstein network is not adversely affected, 
this is clear evidence that the internal representations of fı were not altered 
drastically during training on T}. This analysis will give insights into how much 
the activation at a specific layer has changed after incremental training. 


Results: The layer-wise activation drift measured with layer stitching for the 
incremental learning scenarios is displayed Figure ??. It is apparent that in the 
class-incremental scenario (Pascal-15-5) the encoder layers up until layer 8 are 
not at all affected by activation drift and only later encoder layers or specific 
layers in the decoder show significant representation shift, as already pointed 
out in recent work BIBO]. However, in the domain-incremental learning 
setting we see that primarily the first layers are affected by activation drift and 
later layers only change slightly. 


Limitations: A limitation of this approach is that it cannot measure positive 
backward transfer without the additional stitching layer, in which the model 
would learn a new improved representation for old data while learning a new 


33 


Tobias Kalb 


Model fo - Trained on Ty Model fo - Trained on To 


Le > 


Model f - Initialized as fọ then trained on T4 Model f4 - Initialized as fy then trained on T4 


Figure 5.1: Comparing the original Dr. Frankenstein layer matching |4] (left) with the approach 
without additional stitching layer (right). 
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Figure 5.2: Activation drift between fı to fo measured by relative mIoU on the first task of the 
Frankenstein Networks stitched together at specific layers (horizontal axis). The layers of the encoder 
are layer 0-16 (grey area), the decoder layers are 17-20 (white area). In the class-incremental 
Pascal-15-5 setting the activations the early layers of the encoder stay very stable for all methods, 
only EWC and Fine-Tuning have a severe drift in activations in the decoder layers of the network. In 
the domain-incremental learning setting only the first layers (0-8) are affected by activation drift 
and layers 8-20 layers only change slightly. 


task. This could occur when a feature that was discriminative for To is replaced 
with a feature that is more useful for discriminating all classes. In that case fo 
could no longer extract useful information from the stitched representations of 
fı, which would lead to a performance drop although these representations are 
still useful for fı to classify all classes. 
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5.2 Centered Kernel Alignment (CKA) 


CKA is a similarity index that measures the similarity between internal 
representation of neural networks. Given |T| samples and their corresponding 
matrices of activations A, € R'7!*? and B,, € RITI”? of p neurons at a specific 
layer n of two neural networks fa and fe linear CKA is defined as: 


An Balle 
| B3 BllellAn Anlle 


CKA(An, Bn) = (5.1) 


|| - || denotes the Frobenius norm. Linear CKA has recently been used to 
compare the intermediate representations of models fo and fı in continual 
learning (5} (20). in which a high CKA score equates to lower representational 
forgetting. Csiszärik et al. [4] investigated the relationship between representa- 
tional similarity that is measured by CKA and functional similarity measured by 
Dr. Frankenstein. In this case functional similarity means that the representation 
lead to the similar output of the model, whereas representational similarity is 
directly measuring the distance between representations. They demonstrate that 
a network can retain high functional similarity using Dr. Frankenstein while 
simultaneously decreasing the similarity index measured by CKA. In other 
words they can change the representations of a layer while the output of the 
entire network is not affected. 


Results: In Figure[5.3]we compare layer stitching results with the similarity 
measured by CKA. For the encoder layers we observe very similar layers, where 
we only see a mild shift in representations. However, in the decoder layers we 
see vast differences, as EWC retains much higher representational similarity 
thane Replay or FT. This seemingly contradicts the results in Table[4. 1] where 
EWC shows significantly more forgetting than Replay. In combination with the 
layer stitching plots this could also indicate that small representational changes 
in the decoder lead to significant functional changes in the output. 


Limitations: Similar to layer stitching CKA is also not able to measure 
positive backward transfer, so that one needs to be aware that not every drift 
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Figure 5.3: Activation drift between fı to fo measured by CKA (right) and layer stitching 
(left) at specific layers (horizontal axis). 


in activation has a adverse effect on the performance of the previous task. The 
results show that higher representational similarity does not directly indicate 
better performance on the previous task. 


6 Re-Training and Re-Estimation 


Re-Training and Re-Estimation methods try to freeze specific layers of the 
network and re-train the remaining layers on all datasets, to show how useful the 
features of the frozen layers are to solve the joint task. Decoder Re-Training and 
Linear Probing freeze the backbone of the model and re-train the classification 
layer or the decoder of the network, to estimate how discriminate the features of 
the backbones are. Partial Retraining Accuracy, on the other hand, measures 
forgetting for single layers of the network while the remaining are trained 
from scratch. Finally, Batch Normalization Re-Estimation is used to measure 
the contribution of the changing Batch Normalization population statistics to 
catastrophic forgetting. 
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6.1 Batch Normalization Re-Estimation 


A major contributor to the activation drift of a model trained incrementally are 
the changing population mean and variance of the Batch Normalization (BN) 
layers, which are collected during training to achieve a deterministic behavior 
for inference I]. While this works for iid] data, in the non-i.i.d incremental 
learning setting the BN estimates of the population mean and variance are 
heavily biased towards the most recent task, leading to a significant drop in 
accuracy on old tasks 114]. A straightforward method to measure the impact of 
changing BN statistics is to re-estimate them on the joint dataset. This can be 
achieved by simply doing a forward pass over the entire joint dataset, without 
the backward pass. 


Results: Table[6.1]shows the respective re-estimation results for the domain- 
and class-incremental experiments. By comparing the AmloUgn we see 
that the changing BN statistics have much more significant impact in domain- 
incremental learning. Furthermore, in the domain-incremental setting replay 
alleviates the change of BN statistics completely as re-estimation even slightly 
decreases the mIoU. Therefore, we conclude that changing BN statistics are a 
significant contributor to forgetting in the domain-incremental setting and that 
BN re-estimation can be an important tool to reveal this effect. 


Limitations: BN Re-estimation can only give a measure on which BN layers 
are affected by the changing BN population statistics, but allows no insights into 
the direct causes of the change. However, it can be vital to understand how a 
continual learning algorithm is affecting the BN statistics, e.g. Replay stabilizes 
population statistics using the Replay buffer. Finally, it should be noted that this 
method is not applicable to the recent Vision Transformer (6) architectures as 
they use Layer Normalization | 1] instead of BN. 


! independent and identically distributed 
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Method Class-Incremental Domain-Incremental 
mloUgn AmloUgn | mloUgy AmloUgyn 
Fine-Tuning 44 0.1 47.0 10.6 
EWC 36.7 4.1 57.6 17.4 
Replay 46.3 0.8 57.9 -0.3 


Table 6.1: Performance in mloU [%] of the adapted model fı after re-estimating the population 
statistics of all BN layers. By measuring and comparing the increase after re-estimating BN statistics 
(AmIoU gy), we see that in class-incremental learning re-estimating BN statistics leads to a less 
significant increase compared to the domain-incremental setting. 


6.2 Partial Retraining Accuracy (PRA) 


Murata, Toyota, and Ohara measure representational forgetting of a specific 
layer g+, with Partial Retrain Accuracy (PRA), which is the accuracy that can be 
gained after freezing g+, while the remaining part of the model is re-initialized 
and re-trained on all data from the previous tasks. After that they re-order the 
sequence in which the tasks are learned, to prevent the effect the task order has 
on the learned representations. Using this method they show non-negligible 
amount of forgetting is already happening at shallow layers. 


Limitations: The validity of this method is questionable, because the majority 
of the network is re-trained on the joint data of the model so that activation drift 
of intermediate layers can potentially be rectified by following layers as they 
are trained on the joint task. E.g. when freezing only the very first block of 
a network in the domain-incremental setting the remaining layers will amend 
the activation drift of the first layer, although the very first layers are known 
to be causes of severe forgetting in this setting. So while the aforementioned 
approaches that directly compare the activations are not able distinguish whether 
the model has learned a new representation for old data or if the previous 
representation has been overwritten, PRA can falsely lead to the conclusion that 
a new representation has been learned due to re-training. 
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6.3 Decoder Retrain Accuracy and Linear Probing 


Decoder Retraining and Linear Probing |5] aim to measure representational 
forgetting by calculating the difference in accuracy an optimal classifier layer 
achieves on an old task before and after introducing a new task. Since the methods 
are similar except that Decoder Retraining is intended for semantic segmentation 
and Linear Probing for classification, we only consider Decoder Retraining in 
this section. To measure the Decoder Retraining Accuracy, the encoder of the 
model is frozen while the decoder is re-trained on all classes with the same 
training configuration and subsequently evaluated on all tasks. While the gain 
in mloU AmIoU gives a measure on how much the decoder is contributing 
to forgetting, the mIoU g shows how useful the learned representations are to 
discriminate between classes of different tasks. Linear Probing and Decoder 
Re-Training both have been used to show that continual learning methods that 
seem not effective in the class-incremental setting such as EWC, are in fact able 
to stabilize internal representations and that only a few final layers are the main 
contributor to deteriorating performance on the old task 10} [5]. 


Limitations: Decoder Retrain Accuracy and Linear Probing are aimed at 
differentiating between representational forgetting in the encoder and forgetting 
in the the decoder. They indicate how discriminative the features of the 
backbone are to distinguish all observed classes. However, they cannot give 
further insights into which layers are affected. Furthermore, it is not as useful in 
the domain-incremental setting, because the forgetting is mainly affecting early 
layers. 


7 Weight Drift 


Instead of measuring activation drift, it is also possible to measure the changes 
of the model from fo to fı simply by calculating the fz distance of the models 
normalized parameters Tae to mn 1191. Furthermore, to see how individual 
layers are affected by weight drift it is also possible to measure the distance 
between specific convolutional layers or BN layers. 
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Method Class-Incremental | Domain-Incremental 


jat Bi _ Me p (b _ Bo 
(up Ta) | een we 


Fine-Tuning 0.1 0.2 
EWC 0.02 0.11 
Replay 0.14 0.25 


Table 7.1: Weight distance calculated as £2 distance of the models parameters 0o to 01. The 
distance between the models’ parameters is lowest for EWC as it explicitly constraints updates on 
existing parameters. Weight distance is largest for Replay which is least affected by drop an mloU, 
indicating that weight distance does not always correlate with a drop in performance. 


Results: In Table[7.1]we display the £a distance for the models parameters for 
the class- and domain-incremental setting. We notice that the models trained 
with EWC stay the closest to 9, as it explicitly constraints updates on existing 
parameters. Although Replay is least affected by catastrophic forgetting, we 
observe the largest £2-distance between the model’s weights. This indicates that 
less weight drift does not directly indicate more forgetting. We additionally 
compare the layer-wise distances of the convolutional layers in Figure 
Interestingly, the model trained with EWC has only very minor changes in 
the weights of the model up until later layers in the decoder of the network, 
which coincides with the layers, in which we also observe a significant drop in 
similarity for layer stitching. Finally, we also compare the £o-distance of the BN 
layers in Figure[7.1] for the domain-incremental setting, in which we see that the 
very first BN layer undergoes the most drastic changes. 


Limitations: The major difference between measuring weight drift instead of 
activation drift is that weight drift does not take the training data into account. 
However, we observe that the distance of parameters of the model is not indicative 
of the performance drop on the previous task. Therefore, we conclude that it 
can be used to interpret how the weights have changed, but it should not be 
understood as a direct measure for catastrophic forgetting. 
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Figure 7.1: The distance of fı and fo measured by £2 distance of the weights of the convolutional 
layer in the class-incremental setting (left) and the running mean of the BN Layers in the domain- 


incremental setting. 


Class-Incremental Domain-Incremental 


Method Layer Weight | Layer Weight 
Forgetting | CKA | Stitching | Distance | AmloUg | AmloUgy | Forgetting | CKA | Stitching | Distance | AmloUgx 
Fine-Tuning 9.0 45.0 15.9 0.1 18.9 01 31.9 78.4 54.7 0.2 10.6 
EWC 23.8 88.5 60.4 0.02 11.3 41 28.1 78.3 57.6 0.11 17.4 
Replay 39.1 76.2 80.2 0.14 3.8 0.8 10.1 94.8 83.8 0.25 -0.3 


Table 7.2: Comparison of the discussed methods to measure catastrophic forgetting. As CKA and 
layer matching measure similarity of acitvations for every layer, we report only the minimum value. 
Values in bold indicate the method that supposedly is least affected by forgetting according to the 
measure used. The results clearly indicate that catastrophic forgetting is nuanced as many effects 
contribute to forgetting in a different manner, so that there is no single measure that can show the 


full picture. 
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8 Conclusion 


In this report we evaluated and discussed tools to assess the effects of catas- 
trophic forgetting. In a series of experiments, we demonstrate the strengths 
and weaknesses of these tools. We find that these approaches work best in 
combination since they complement each other and capture different effects. For 
example, measuring activation drift with CKA or layer stitching is helpful to 
locate forgetting, but BN reestimation and Decoder Re-Training are required 
to identify the causes. Furthermore, we found that evaluating weight distances 
does not correlate with the drop in performance of previous tasks and should 
not be interpreted as a measure of catastrophic forgetting. Finally, we note 
that measures of activation drift such as layer matching and CKA are useful 
in both domain- and class-incremental settings, whereas BN Re-Estimation 
is more insightful in domain-incremental learning and Decoder Re-Training 
in the class-incremental learning. A summary the results is also displayed in 
Table [7.2] Our report reveals that catastrophic forgetting is nuanced as many 
effects contribute to catastrophic forgetting in a different manner, so that there is 
no single measure that can show the full picture. 
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Abstract 


The discovery of causal relations via interventions has proven to be simple when 
only one observed variable is affected or unaffected. However, in a multivariate 
setting, it is likely that more than one variable is affected by the intervention. 
Thus drawing conclusions about the true causal graph becomes far more difficult 
as we can not retrieve information of any obvious causal relationship or causal 
order. We demonstrate, that causal discovery with multiple affected variables is 
possible by introducing a novel definition of path constraints for constraint-based 
causal discovery. We exercise our novel technique on a combustion engine 
simulation, were we inject wavelets of our choice in a variable of investigation 
and try to rediscover this wavelet in the other, observed variables to gain such 
path constraints and thus to restraint the causal graph search. 
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1 Introduction 


Causal discovery deals with finding cause and effect relationships in data. Its 
subdomain of interventional causal discovery tries to achieve this knowledge 
gain by performing experiments [5]. 


Current methods for interventional causal discovery either inspect the interven- 
tional effect in upmost one variable and thus recover few causal information or 
cannot handle interventional effects in multiple variables. We want to investigate 
a specific approach that is low-intrusive and injects a signal into a running 
process and gathers information about its occurrence in the causal graph to 
deduce information about causal relations. We will demonstrate an example 
experiment on a combustion engine simulation. The paper is structured as fol- 
lows: In Section ??, we shed some light on existing approaches for interventional 
structure learning. In Section [8] we introduce our fundamentals for the novel 
low-invasive technique. In Section|4] the signal injections are demonstrated on a 
combustion engine example step by step. In Section [4] we draw the conclusion. 


2 Causal Graphs 


Causal graphs consist of a set of nodes representing variables V and a set of 
edges E representing causal relations. If a directed edge points from A € V 
to B € V, then variable B is caused by A. A path from A to B is a chain 
of consistently directed edges C(A, B) = {A > X,...,X; > B}, Xic V, 
i € N directed from A to B with a number of edges being equal or greater 
than one |C(A, B)| > 1. A direct causal relation between variables indicates 
|C(A, B)| = 1, but an indirect causal relation indicates a path |C(A, B)| > 1. 


The goal of causal discovery is to gain knowledge over the true causal graph 
for the inspected environment. According to the theory of constraint-based 
learning [6], all the potential graphs form an equivalence class in respect to our 
knowledge about the edges and variables of the graph itself. If the number of 
graphs in the equivalence class equals one, we assume to have found the true 
causal graph. But if no knowledge of the causal graph is given and the larger the 
amount of inspected variables, the more edges have to be inspected. Table [2.1] 
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shows, how the number of graphs in an equivalence class grows exponentially 
with the number of variables and edges. One can gain these constrains, by either 
inspecting data, or as in our case performing experiments. 


2.0.1 Interventional Causal Discovery 


Using interventions for causal discovery is one of the oldest and most popular 
approaches in science. Even despite their costliness, their potential of being 
unethical or by being simply not feasible. It is assumed that a variable is a cause 
of another variable B if an intervention on A also affects the associated variable 
B 8]. distributed the existing approaches in two major categories called 
structural interventions and parametric interventions. 


2.0.2 Structural Interventions 


As shown in Figure. 1] structural interventions (also called hard interventions) 
cut off all causal influences to the variable under intervention and determines 
its probability distribution completely. For example in randomized controlled 
drug trails Bl. the treatment drug a patient receives is determined randomly 
but always one of several options. Here, as stated by the potential outcomes 
framework, the causal effect is identified by structurally intervening on the one 
variable while observing the effect in the other. 


Parametric Interventions Parametric interventions (also called soft interven- 
tion) intervene on the probability distribution of a variable by adding another 
cause to it or its causes. They do not disturb the original causal structure, but 


Count of Variables 1 2 3 4 5 6 
Count of Potential Edges 0 1 3 6 10 15 
Count of Potential Graphs 0 4 64 4,096 1,048,576 1,073,741,824 


Table 2.1: Overview of the exponential increase in graphs with growing number of edges and 
variables 
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(a) Original Graph (b) Structural Intervention (c) Parametric Intervention 


Figure 2.1: Causal graphs after a structural and a parametric intervention on variable B 


no foreign influences of other variables can be prevented with certainty. In 
comparison, parametric interventions are a rather new technique. 


3 Essentials of Wavelet Injections 


The new intervention method can be classified according to Section ?? as soft 
intervention-based, since the original causal network is not perturbed but an 
additional variable is added as a cause to the inspected variable. The particularity 
of these interventions lies in the fact that a wavelet in the form of a wavelet is 
added to the variable and is tried to be rediscovered to gain causal information. 


When injecting a wavelet into a variable A, the injected wavelet and the timeseries 
coming from the causing variable are added up. We assume the wavelet to 
spread in the graph in direction of the causal relations. If we find the wavelet in 
one variable, we assume a direct causal relation to be present, but in case of a 
discovery in several other variables, including B, we may not. Instead, we gain 
knowledge about an existing path between the variables with |C(A, B)| > 1, 
since the wavelet must have traveled somehow from A to B. We will refer to 
such path information as path constraints from here on. With each gained path 
constraint, we can remove all graphs from the equivalence class of potential 
graphs that have no path between the investigated variables present and hence 
do not support it and thus the number of potential graphs is decreased. 


For signal recovery, we normalized the measured values. Otherwise, the different 
scales of the variables would make a comparison difficult. Then we applied 
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on each measured variable the fast pattern matching algorithm called Mueens 
ultra-fast Algorithm for Similarity Search (MASS) 9]. It stepwise matches a 
desired pattern to a subsequence of the inspected timeseries and calculates the 
z-normalized distance. The aggregation of these distances results in an overall 
distance profile. If its minimal distance is below a chosen threshold, we assume 
the position to be our signal. Otherwise, we assume the signal to be absent in 
the observed variable and thus we gain no path constraint. 


Note, that in general, we do not consider information about variables in which 
the injected signal could not be found, as the wavelet may be lost due to various 
reasons. For example, the signal may be of unfortunate form and hence be 
canceled out by the causal graph itself, it may be heavily deformed and thus be 
not recoverable or simply be too weak to be noticeable in other variables. 


4 Applying Signal Injections 


Here we demonstrate how we applied the signal injections on a combustion 
engine dataset and explain the experiment step by step. 


Step 1: Wavelets for Injection 


We decided to use three very distinct and well-defined wavelets for our signal 
injection in the combustion engine. These are a Daubechie 4 wavelet, a Mexican 
Hat wavelet and a Haar wavelet. They are depicted in Figure We have 
chosen these wavelets because they contain amplitudes in the positive and 
negative value range and have a unique shape. 


Step 2: Simulation Setup 


As a testing environment, we used a running combustion engine simulation 
[1). To evaluate the performance of the novel discovery approach, we inferred the 
simulation’s true causal graph as is shown in Figure[4.2] Here, we give a brief 
explanation of the causal relations: The angle of the throttle plate influences 
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Figure 4.1: The wavelets we used for signal injection 


how much air intake in the motor cylinder is possible. The air intake over time 
adds up to the aircharge in the cylinder before combustion. After combustion, 
depending on the aircharge, the torque increases and finally the overall engine 
speed rises. The increase in engine speed also depends on the load carried by 
the engine. 


We added multiple sensors to the simulation to retrieve the timeseries required 
for signal recovery. In total we took measurements of six variables including 
the actuated variable. We injected the signals in the actuation of the angle 
only and inspected all the other measured timeseries for traces of the injected 
signal. According to Figure [4.2] we expect to find the wavelets in the air intake, 
aircharge, torque and engine speed variable, but not in the load variable, as it is 


52 


Constraint-based Causal Discovery by using Path Constraints 


Engine speed 


Figure 4.2: The true causal graph of the combustion engine simulation 


independent of the angle variable. The wavelet rediscovery method are required 
to come to the same conclusion. 


Step 3: Signal Discovery 


As an implementation of the pattern matching algorithm, we used the python 
package stumpy['] With it, we found several wavelets in all variables depending 
on the influenced angle variable. Table[4. 1] gives an overview of the wavelets 
we rediscovered. All in all, the Daubechie 4 wavelet and the Mexican hat 
wavelet performed best, as they were found in all variables depending on the 
angle variable in their actual positions. For evaluation purposes, we determined 
the actual position in advance, by comparing the measurements with wavelet 
injections with measurements without injections. Any divergence between 
those measurements must be caused by the wavelet. We decided to use this 
information only for evaluation, as we want to gain causal information with 
minimum number of measurements. 


Uhttps://stumpy.readthedocs.io/ 
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Table 4.1: An overview of the wavelets we injected into the angle variable and if they could be 
recovered in the other variables in their actual position. 


Figure[4.3] presents an excerpt from our results for the aircharge and the load 
variable for each of the three wavelets. The colored area is where the signal 
was rediscovered by the pattern matching algorithm. It is colored green, when 
the wavelet is found in its actual position, but if it is red, it was found in a 
wrong position or in a variable, where no wavelet influence is present. In the 
aircharge variable, both the Daubechie 4 wavelet and the Mexican hat wavelet 
were rediscovered in their actual position. Only the Haar wavelet was found in 


the wrong position (Subfigure|4.3(e)). 


In the Load variable, no signal should be found at all, as the variable is not 
influenced by the angle variable where we injected the wavelet. But the discovery 
algorithm wrongly found the Haar wavelet in the Load variable (Subfi gure|4.3(F)p. 
We assume the mistaken discoveries of the Haar wavelet to be because of the 
simple wavelet form, it fits in several places of a time series, even if it is not 
present at all. Both Mexican hat wavelet and Daubechie 4 wavelet proved 
complex enough to allow a safe rediscovery. 
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Figure 4.3: The wavelets as they were discovered in the exemplary aircharge and load variable. The 
area where the lines diverge indicates the presence of a signal and is highlighted green as a mark 
for successful recovery. Red highlights indicate a wrong recovery. If the lines do not diverge, no 
wavelet is present and nothing should be discovered. 
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Step 4: Causal Inference and Results 


From the previous step, we were able to retrieve from the Mexican hat 
and the Daubechie 4 wavelet injection independently four path constraints 
{C(Angle, Air intake), C(Angle, Aircharge), C (Angle, Torque), 

C(Angle, Engine Speed)}. Next, we implemented a brute force algorithm, that 
generated all viable graphs for given variables and eliminated all graphs from the 
set not supporting the constraints. In total, we were able to reduce the number 
of graphs from 1,073,741,824 to 23,855,104 and eliminated with this method 
approximately 97.8% of potential causal graphs. The number of graphs may be 
reduced further by performing additional wavelet injections. 


5 Conclusion 


We explained the idea of discovering causal knowledge by injecting and retrieving 
wavelets in causal variables. For injection, we simply added a chosen wavelet to 
the incoming timeseries of a variable and tried to rediscover it in the depending 
variables via fast pattern matching. We gained causal information by defining 
path constraints to restrict the equivalence class for the true causal graph, as 
a path is assumed to be present between injected variable and the variable of 
rediscovery. We demonstrated the procedure on a running combustion engine 
simulation by adding three different wavelets (Haar, Daubechie 4 and Mexican 
hat) to an actuated variable. The procedure performed well for the Mexican hat 
wavelet and the Daubechie 4 wavelet. By using either of them, we were able 
to receive four path constraints and to reduce with them the number of graphs 
from 1,073,741,824 to 23,855,104. 
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Abstract 


Many multi-person trackers follow the tracking-by-detection paradigm applying 
a person detector in each frame and linking detections of the same target 
to form tracks in the association task. While the basic concept is the same 
among these methods, various motion models, distance metrics to measure 
the similarity of targets, and matching strategies are used. This makes it 
difficult to compare different methods and also to assess the influence of single 
tracking components on the final performance. For these reasons, all parts of the 
association task are thoroughly investigated in this study. Starting with a simple 
baseline which is consequently improved with the help of experimental results, 
a strong tracking-by-detection-based framework is developed that achieves 
state-of-the-art performance on two multi-person tracking benchmarks. 


1 Introduction 


The objective of multi-person tracking (MPT) is to detect and identify all persons 
in every frame of a given video. Applications range from crowd monitoring to 
autonomous driving and surveillance related tasks. 
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To solve the MPT problem, most methods pursue the tracking-by-detection 
(TBD) paradigm. A detector is applied on each image independently and the 
obtained detection sets are matched such that detections of the same target form 
a track with a unique ID. This problem of assigning the correct detections to 
the corresponding tracks is called the association task. While some approaches 
try to integrate detection and association more tightly [2] |9] [43], the 
strict separation of the two sub-tasks in TBD can still achieve state-of-the-art 
results. Currently, the top performing entries of the standard MPT benchmarks 


MOTI7 and MOT20 [6] follow the TBD paradigm 


leveraging an of-the-shelf detection model and focusing on the association task. 


Different strategies to improve the association can be observed in the literature. 
Motion models based on Kalman filter are used to make the estimated target 
positions more accurate [4][5|/10|/40]. In addition, camera motion compensation 
techniques are integrated to deal with motion of non-static cameras 
291. The core of the association is the distance measure which determines how 
likely a detection belongs to a so-far tracked target. On the one hand, motion- 
based metrics such as Intersection over Union (IoU) are utilized and on the other 
hand, the appearance of targets is leveraged. For example, in DeepSORT 
and its further development StrongSORT 110], a person re-identification model 
is applied to extract appearance features from the image patches of the detections 
and cosine distance between the high-dimensional features is taken as association 
metric. While DeepSORT uses motion distance only for gating, i.e., prohibiting 
unlikely assignments, StrongSORT combines it with appearance distance as 
also done in 117. Besides the distance metric, the association strategy has a 
large influence on the performance. While most methods make all assignments 
at once with the Hungarian algorithm [17], DeepSORT proposes a matching 
cascade that prefers previously observed targets and ByteTrack performs a 
second matching step in which low-confident detections are utilized. 


In this study, all aforementioned components of the association task in MPT 
are analyzed in detail. Starting with a baseline TBD approach with strong 
motion models, a large number of experiments with different distance measures, 
both motion- and appearance-based, and their combinations are conducted. In 
addition, matching strategies with multiple stages are investigated. With the help 
of the experimental results, a strong TBD method is developed which achieves 
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state-of-the-art results on the two MPT datsets MOT17 and MOT20 [6]. 
Furthermore, ablative experiments of the proposed framework are performed 
showing the influence of the single tracking components as well as the sensitivity 
of the tracking parameters on the final performance. 


2 Baseline and Motion Models 


A baseline tracker using only IoU as matching metric is built before more 
advanced matching measures and strategies are investigated. In addition, various 
motion models for target and camera motion are compared in this section. 


Let T'-! = {T}7},..., T1} be the tracks found until frame I‘! and 
Dt = {D},...,T/} the detections generated on the frame I‘ of a video 
V = |I},...,I”] of length n. The association task is to assign the detec- 


tions DŻ to its corresponding targets 7*7}. For this, distances between all 
confident detections and tracks are calculated and used as cost values. After- 
wards, the overall costs of assignments are minimized, e.g., with the Hungarian 
algorithm 117]. More precisely, given a detection D = (Bp, s) € D, with box 
Bp and confidence s and the box Br of a track T € Ttt, a distance measure 
d can be calculated using the IoU between detection box Bp and track box Br: 


diou =1- IoU(Bp, Br) (2.1) 


Before calculation of the distance matrix of detections and tracks, the detections 
are filtered w.r.t. confidence, i.e., detections with a score s smaller than the 
threshold Strack are removed and not used in the association. In addition, a max- 
imum distance dmax is enforced to prohibit unlikely assignments. Unmatched 
tracks that are not assigned a detection become inactive and are kept for imax 
frames in the set of tracks before deletion. Thus, they can be re-activated for 
a short time period to bridge occlusions, for instance. Unmatched detections 
with high confidence s > Sinit start new tracks. Note that some trackers 
follow an initialization strategy, in which detections first start tentative 
tracks that have to be confirmed in subsequent frames in order to become active. 
While this strategy suppresses frame-wise false positive detections, it introduces 
false negatives since the tentative tracks do not contribute to the tracking output. 
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If the quality of detections is high and a large threshold sinit is set, such an 
initialization technique can reduce the overall performance, so it is not used in 
this study unless otherwise stated. 


Most MPT approaches have in common, that a Kalman filter (KF) is used 
to model the motion of targets. However, various formulations of the state 
vector x and different implementation details can be found in the MPT literature. 
The most used variants are originally from the SORT [4] and DeepSORT 
frameworks. The state vectors of the two KF types are as follows: 


22 AT 
XSORT = (u,v, a,r, ù, ù, a) (2.2) 
RT 
XDeepSORT — (u, v, T, h, U,U,T, h) (2.3) 


The box center position is (u, v) and the aspect ratio is r = w/h with w and h 
denoting box width and height, respectively. A derivative of a variable x with 
respect to time is indicated by ©. Whereas SORT explicitly models the box area 
a = w- h and its derivative å but keeps the aspect ratio r fixed, DeepSORT 
instead models the box height h and its derivative h. Thus, the process and 
measurement noise covariance matrices also differ next to other implementation 
details, which can be found in the papers or the public source code. 


Recently, further developments have been proposed for the DeepSORT variant 
— the Noise Scale Adaptive (NSA) KF and the height preservation (HP) 
adaptation 130]. In the update step of the NSA KF, the measurement noise 
covariance matrix R is weighted with the confidence of the measurement, i.e., 
the detection confidence score s, as follows: 


Rysa = (1-s)-R (2.4) 


The higher the detection confidence, the smaller the adapted measurement noise 
covariance Ryga and the more influence has the detection on the track state 
update. The other adaptation is related to the state vector x. Itis empirically 
found in 130], that predicting inactive tracks for multiple frames without state 
update, the track box size can change dramatically which hinders re-activation 
after occlusion. To prevent this, HP can be applied simply setting the derivative 
h to zero before the KF prediciton step, which is also done in [1] and [46]. 


Besides target motion, modelling camera motion is also important. For camera 
motion compensation (CMC), again two different methods from literature are 
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Table 2.1: Motion Model Results. 


KF Type NSA CMC HOTA | KFType NSA CMC HP HOTA 


SORT x x 67.61 | DeepSORT X X X 67.40 
SORT v x 67.67 | DeepSORT V X X 61.83 
SORT X ECC 67.77 | DeepSORT X ECC X 68.03 
SORT X ORB 68.36 |DeepSORT X ORB X 68.13 
SORT V ECC 68.03 | DeepSORT VY ORB X 68.62 
SORT Y ORB 68.35 | DeepSORT WY ORB V 68.67 


investigated — the Enhanced Correlation Coefficient (ECC) Maximization 
and a model from that is based on the ORB feature detector and the 
RANSAC algorithm. The ORB method is a sparse image registration 
technique in that foreground objects like moving persons can be neglected, in 
contrast to the global ECC method. A similar approach is found in (1). 


To compare the different motion models, several experiments are run on the 
validation split (Val) of MOT17, which is created by dividing the train sequences 
into two halves and using the second ones (35} {46} (48). As detection model, a 
publicly available YOLOX model from is utilized, which has been 
trained on a combined dataset consisting of CrowdHuman (27). CityPersons ; 
ETH [11], and the first half of MOT17 train split. Note that this YOLOX model 
can be regarded as the current standard in MPT on the MOT datasets, since 
many state-of-the-art methods are using it [46]. If not 
otherwise stated, the parameters of the tracker are set to Sinit = 0.7, Strack = 0.6, 
dmax = 0.8, imax = 30 and the resolution of the input images is 1440x 1080 
pixels. To measure the overall tracking accuracy, HOTA is evaluated. 


The results with different KF types, KF adaptations and CMC models are 
summarized in Table Without any extensions, the SORT KF performs 
slightly better than the DeepSORT KF. However, the results of the DeepSORT 
KF can be largely improved with the NSA adaptation, while NSA in combination 
with SORT does not enhance the results in all configurations. This is not 
surprising, as NSA is developed as extension for the DeepSORT KF and the 
measurement noise covariance matrices R differ among the KF types. As 
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expected, ORB outperforms ECC in all experiments. W.r.t. the baselines, ORB 
improves the overall tracking performance by 0.75 HOTA and 0.73 HOTA for 
SORT KF and DeepSORT KF, respectively. Additionally adding the height 
preservation (HP) in the DeepSORT KF variant, aHOTA of 68.67 is achieved 
which is a gain of 1.27 HOTA in comparison to the DeepSORT KF baseline. 
Therefore, the DeepSORT KF with NSA and HP extensions is used in all 
subsequent experiments, together with the CMC model based on ORB features. 


3 Distance Measures 


As mentioned previously, the distance measure is the core of each TBD algorithm. 
In the baseline experiments of the last section, the IoU has been leveraged which 
is the most used motion-based distance metric in MPT. In this section, further 
distance measures for the association are explored. First, motion-based matching 
is analyzed in Section 3.1] Then, appearance-based matching is studied in 
Section 3.2] Both types of infomation are combined in Section before 
further techniques like incorporating the detection confidence and applying 
gating mechanisms are treated in Sectionsß.4andß.5] respectively. 


3.1 Motion-based Matching 


The authors of SimpleTrack experiment with the Generalized IoU (GIoU) 
[24] as similarity measure in combination with appearance information, which 
enhances the performance of their tracker. This raises the question, whether 
other IoU related measures also can improve the matching accuracy. Therefore, 
different adaptations of the original IoU are investigated in the following. Given 
two boxes A = (14, y4,Wa,ha) and B = (ag, yp, wp, hz), the IoU is the 
relation of the intersection AN B to the union AU B: 


(3.1) 


The IoU has the drawback that non-overlapping boxes always yield an IoU of 
0, independent from how far away the boxes are from each other. To solve this 
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issue, the GIoU is proposed as 


C\ (AUB) 
C 
where C denotes the smallest enclosing box of A and B. While the spatial distance 
of the boxes A and B has influence on the box C, it is not modelled explicitly. 
In contrast, the euclidean distance dy2(A, B) = V(r — zB)? + (ya — ys)? 

is directly used in the Distance IoU (DIoU) [47]: 


GIoU = IoU — (3.2) 


DIoU = IoU 


2 
r dial 4, B) (3.3) 
(6 


Here, c denotes the diagonal of the smallest enclosing box C. The same paper 
further introduces the Complete IoU (CIoU) [47], which not only explicitly 
models spatial distance but also aspect ratio consistency: 


CIoU = DIoU - av (3.4) 
4 WA WB í 
T= a (arctan (=) — arctan (=2)) (3.5) 
VU 
ee an 


Note that the IoU and its variants are similarity measures with a maximum 
similarity of 1. Thus, a distance measure can be created by subtracting the value 
from 1 as in Equation [2.1] 


If a Kalman filter is used as motion model, it is possible to integrate the 
uncertainty of the motion estimation into the distance measure. For this, 
DeepSORT and StrongSORT calculate the squared Mahalanobis 
distance between a detection D and a track T, given the state formulation of the 
detection box d = (u,v, r, h)" and the projection of the track state (mean and 
covariance) into measurement space (y, S): 


dmahal = (d = y)'S!(d ~~ y) (3.7) 


Whereas DeepSORT uses only the Mahalanobis distance for gating, i.e., pre- 
venting unlikely assignments by enforcing a maximum distance, here, dmahal 
is directly used as matching distance. Additionally, the euclidean distance dz 
between detection and track center is considered. 
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Table 3.1: Motion-based Matching Results. 


d IoU GloU DIoU CIoU L2 Mahal 
HOTA 68.67 68.47 68.74 68.74 64.70 62.94 


To compare the performance of the aforementioned motion-based distance 
measures d in the association, several experiments are conducted tuning the 
maximum distance threshold dmax for each metric separately. The highest 
achieved HOTA values are reported in Table 3.1] One can see that the IoU- 
based distance measures work much better than taking the L2 distance or the 
Mahalanobis distance. While L2 distance does not consider the important 
information of box dimensions, the Mahalanobis distance is only a rough 
estimation of the object location if the state uncertainty is high (40). In the 
experimental setup, DIoU and CIoU achieve the highest HOTA value of 68.74, 
closely followed by IoU and GIoU. Note that DIoU and CIoU yield the exactly 
same tracking results. Since the aspect ratio of targets does not vary significantly 
in MPT, v in Equation [3.4] becomes a very small value, thus CIoU ~ DIoU 
holds. For this reason, the DIoU is used in the rest of this study. 


3.2 Appearance-based Matching 


Similar to adopting an of-the-shelf detector, many MPT approaches take over a 
model from the re-identification community for extracting appearance features 
of targets [1|{10}{19] [34] {40} [41]. Such a network takes a small image patch of a 
detected person as input and computes a high-dimensional feature vector that 
represents the appearance of the person. In appearance-based matching, several 
design choices have to be made when comparing the features of detections and 
tracks. Which distance measure should be used? How many time steps shall 
be considered to describe the appearance of a track? What is the best way to 
combine features from different time steps? In this section, a large amount 
of experiments is conducted to answer these questions empirically. Given 
two m-dimensional feature vectors fp and fr from a detection and a track, 
respectively, one can calculate either the cosine distance d.os or the euclidean 


66 


A Detailed Study of the Association Task in Tracking-by-Detection-based Multi-Person Tracking 


distance dz to measure their appearance similarity: 
fo: ft 

fol frl 

dia = (fp. = fra)? + las fray p+ (om fam)? G9) 


deos =1- (3.8) 


Note that dcos € [0,2] and dy2 € [0, co] holds and || - || represents the Euclidean 
norm. Studying the source code of a few appearance-based MPT methods, it is 
observed that some methods apply a mask to the cosine distance matrix before 
solving the assignment problem with the Hungarian method. More precisely, all 
entries above the maximum distance threshold dmax are set to dmax + € with € 
being a very small value, e.g., le~°. This causes unlikely assignments with a 
distance above the matching threshold dmax to have the same contribution to 
the overall cost that is minimized by the Hungarian algorithm. 


While the detection feature fp is simply the output of the re-identification 
model, there are multiple possibilities to build the track feature fr. In the 
simplest case, the feature P from the last assigned detection D‘~! of the 
track T! = [Dtinit, .., D*=2, D'~1] is used as track feature: fr = ff, '. To 
benefit from temporal information, DeepSORT builds a feature bank 
Fr =| os re eo | with the features of the past N time steps. The 
distance to a current detection feature f$ is calculated for each feature of the 
bank. The appearance distance d( D, T) between a detection D and a track T is 
then chosen to be the minimum of all distances derived from the feature bank: 
dmin(D, T) = dmin(D, Pr) = min d(fp,fp') (3.10) 
iell,...,N] 
If the target is clearly visible both in one of the last N frames and the current 
frame, the extracted features are of high quality and taking the minimum 
appearance distance is a good choice. However, this is not always the case in 
MPT, especially when facing severe occlusions. In such situations, the mean 
distance might be a better choice: 


N 
1 -i 
dmean(D, T) = dmean(D, Fr) = WU fb: = ) (3.11) 
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Moreover, it is possible to average the two measures, which results in a third 
strategy to calculate the appearance distance between a detection and a track: 


1 
AinsanminlD; T) = 5 (dinean(D, Fr) + dmin(D, Fr)) (3.12) 


The last investigated strategy for computing the appearance distance is adopted 
from 138]. Instead of using a feature bank, the track feature fr is updated in an 
exponential moving average (EMA) fashion with the newly assigned detection 
feature f$ and a weighting factor a in each time step: 


fp =afp +(1-a)fh (3.13) 


The re-identification model from [1] is leveraged for feature extraction in the 
experimental evaluation. It is a Bol (SBS) model with ResNeSt50 as 
backbone, trained on the first half of MOT17 train split. The performance of 
the aforementioned appearance-based distance measures and strategies is again 
compared on the MOT17 Val split, whereby the maximum distance threshold 
dmax is optimized for each configuration separately. For experiments using the 
EMA technique, the corresponding parameter a is also tuned. 


The resulting HOTA values are reported in Tableß.2] One can see that masking 
the distance matrix is beneficial for cosine distance but not euclidean (L2) 
distance. With masking, cosine distance outperforms L2 distance by 0.36 
HOTA. Taking N = 10 past time steps in a feature bank into account, the results 
improve significantly by 1.40 to 2.25 points, depending on the strategy of the 
distance calculation. This shows the importance of temporal information in 
appearance-based matching. The best results are achieved by averaging the mean 
and minimum distance of the features (mean+min). Increasing the number of 
features yields improvements up to N = 10, while HOTA values decrease again 
using 20 or even 100 features. The EMA strategy achieves competitive results but 
HOTA is 0.25 points worse than the best configuration — the mean+min strategy 
with N = 10 past features and masked cosine distance — which achieves 68.72 
HOTA. Note that the overall performance of the appearance-based matching is 
on par with the motion-based matching from the previous section (Tableß.T). 
However, on indiviual sequenves of the dataset, differences in HOTA up to 4 
points are observed. This motivates the combination of motion- and appearance- 
based matching which is investigated in the next section. 
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Table 3.2: Appearance-based Matching Results. 


d Masking N Strategy EMA HOTA 
Cosine X 1 x x 66.19 
Cosine v 1 x x 66.47 

L2 x 1 x x 66.11 

L2 J 1 x x 66.03 
Cosine v 10 min x 67.87 
Cosine v 10 mean x 68.20 
Cosine v 10 mean+min x 68.72 
Cosine J 1 mean+min x 66.47 
Cosine v 2 mean+min x 67.25 
Cosine J 5 mean+min x 68.26 
Cosine v 10 mean+min x 68.72 
Cosine v 20 mean+min x 68.60 
Cosine v 100 mean+min x 68.05 
Cosine v 1 x v 68.47 


3.3 Combined Matching 


Motion- and appearance-based distance measures provide different types of 
information. Thus, combining both kinds to an advanced distance measure is a 
promising approach which is also followed in other works [1][10][18]. Given 
two distance measures dı, dz and corresponding weights wı, wa, a combined 
distance dom» can simply be built by a weighted sum: 


deomb = wıdı + wada (3.14) 


For motion information, the IoU-based distance measures diou, darou and dptou 
are considered, while the feature cosine distance d.os is used for appearance 
information. Experiments with different configurations are conducted on MOT17 
Val. Note that the maximum distance threshold dmax is adjusted when changing 
distance measures or one of the weights w; or wa. The resulting HOTA values 
are listed in Table [3.3] The previously achieved results using either motion- 
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Table 3.3: Combined Matching Results. 


dı də wı W2 HOTA dı də wı W2 HOTA 
IoU x X X 68.67 IoU Cosine 1 2 69.16 
GloU x X X 6839 IoU Cosine 1 3 69.13 
DIoU x X X 6874 IoU Cosine 1 4 69.22 
Cosine x X X 687 IoU Cosine 1 5 69.04 
IoU Cosine 1 1 68.91 | GloU Cosine 1 4 69.37 
IoU Cosine 2 1 68.62 | DIoU Cosine 1 4 69.41 


or appearance-based information are also given for reference. The best results 
are very similar with HOTA = 68.74 for DIoU and HOTA = 68.72 for cosine 
distance, which justifies the usage of both cues. Combining IoU distance and 
cosine distance with equal contribution (wı = wa = 1), HOTA improves to 
68.91. Giving more weight to the motion-based measure (wı = 2, wa = 1), the 
performance decreases. However, if the appearance information is taken more 
into account (wı = 1, wa > 1), HOTA can be further enhanced up to 69.22 
for wa = 4. The same holds true for combining GIoU or DIoU distance with 
appearance cosine distance. The largest HOTA value of 69.41 is obtained by 
combining DIoU distance and cosine distance while setting wı = land wa = 4, 
i.e., giving four times the weight to the appearance information. This is a gain 
of 0.69 points in HOTA compared to using only one distance measure. 


Note that experiments have also been conducted with the Mahalanobis distance 
dmahal (Equation[3.7) in combination with the appearance cosine distance as it 
is done in StrongSORT [10]. The highest achieved HOTA in the experimental 
setup is 69.13. While this is also an improvement w.r.t. using only appearance 
information, the performance is worse than combining DIoU distance with the 
appearance cosine distance. Therefore, the combination of DIoU distance and 
cosine distance is utilized in the remainder of this study. 
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Table 3.4: Use Detection Confidence Results. 


d dscore HOTA d= dscore HOTA d dscore HOTA 
IoU X 68.67 |DIoU X 68.74 |DIoU+Cosine X 69.41 
IU Vv 68.78 | DIU VW 68.79 | DIoU+Cosine WV 69.19 


3.4 Use of Detection Confidence 


Some IoU-based MPT methods incorporate the detection confidence s into the 
distance calculation by simple multiplication [1]28][46]: 


diou,score(D,T) = 1 — (IoU( Bp, Br) - s) (3.15) 


The motivation behind it is that more confident detections should be favored 
in the association. Note that this strategy can also be applied together with 
other IoU-based metrics and its influence is investigated empirically. Because 
the multiplication of s € [Strack, 1] changes the scale of the distance measure 
d, the maximum distance threshold dmax has again been tuned. The results 
are depicted in Table [3.4] Integrating the detection score into the distance 
matrix slightly improves HOTA by 0.11 and 0.05 points for IOU and DIoU 
distance, respectively. However, in combination with the appearance cosine 
distance, which yields the overall best results, using the detection score degrades 
the performance. Thus, the detection score is not leveraged in the distance 
calculation in the remainder of the study. 


3.5 Gating 


As mentioned before, DeepSORT utilizes the Mahalanobis distance to 
prevent unlikely assignments which is referred to as gating. The distance 
measure is only used to prohibit assignments with a distance value above a 
threshold but is not integrated into the matching distance. In this section, the 
influence of such a gating mechanism on the tracking performance is analyzed. 
Besides Mahalanobis distance, IoU, DIoU and appearance cosine distance are 
tested as gating measures. The combination of DIoU and cosine distance from 
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Table 3.5: Gating Results. 


Gating x IoU DIoU Cosine Mahal 
HOTA 69.41 69.45 69.47 69.41 69.42 


Section B.3]is taken as distance for matching. Tracking results with additional 
gating are depicted in Tableß.5] In the experiments, only small HOTA gains 
up to 0.06 points are achieved, although the gating thresholds have been tuned 
carefully. For this reason and because a too small gating threshold can degrade 
the tracking performance, gating is not used in the rest of this work. 


4 Multiple Matching Stages 


It is the common practice in MPT to solve the assignment problem for all tracks 
and detections at once as also done in this study so far. However, a few works 
split the set of tracks or detections into subsets which are processed one after 
another 40] [46]. Two strategies are revisited — a matching cascade 
from the famous DeepSORT tracker (Section 4.1) and the BYTE 
association method which recently lead to notable improvements (Section[4.2). 


4.1 DeepSORT Matching Cascade 


Given an example track T* = [D%=*,..., D'~*] at time step t, its age a is 
defined as the time since the track has been observed for the last time. For this 
example track, a = k holds. Note that in this definition, active tracks have an 
age of 1, whereas inactive tracks have an age greater than 1. In DeepSORT [40], 
tracks with an age of 1 are matched with all available detections. Then, all tracks 
with an age of 2 are matched with the remaining unmatched detections and 
so forth. The motivation behind this strategy is to favor tracks that have been 
observed recently, since the accuracy of propagated track locations decreases 
over time. However, in StrongSORT — a further development of DeepSORT 
— it is found that this matching cascade harms the tracking performance when the 
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Table 4.1: DeepSORT (DS) Matching Cascade Results. 


DS Matching Cascade HOTA DS Matching Cascade HOTA 
X 69.41 v 67.86 


tracker gets stronger because the additional prior constraints limit the matching 
accuracy 110]. To investigate the influence of the DeepSORT matching cascade 
on the so-far best tracker of this study (Sectionß.3), it is utilized in an additional 
experiment. The result is shown in Table[4.1] Integrating the matching cascade 
significantly decreases HOTA by 1.55 points which confirms the results from 
|10]. Obviously, this matching cascade is not used in further experiments. 


4.2 BYTE Association 


Usually, only high-confident detections are used in the association as low- 
confident ones include many false positives that harm the tracking performance. 
In contrast, an association technique named BYTE is proposed in 146], which 
allows to make use of low-confident detections in a second matching stage. 
Detections with confidence score below Strack are not removed but compared to 
unmatched tracks that have not been assigned a high-confident detection in the 
first association. Since the low-confident detections are not utilized to start new 
tracks but only for assignment to already tracked targets, the overall performance 
can be largely increased. The authors of show this by applying the BYTE 
association to different trackers which leads to consistent improvements. Among 
the trackers, the varying distance measures are kept in the first matching stage. 
However, in the newly introduced second matching stage, only the IoU distance 
is leveraged as the authors argue that most tracks in this stage suffer from 
occlusion or motion blur, where appearance features are not reliable [46]. 


Since the tracking pipeline of this study differs quite a lot from other approaches 
with the improved motion modelling from Sectionß.1]and the combined distance 
measure from Section it is also experimented with appearance-based 
cosine distance next to other distance measures in the second association stage. 
Although it is not mentioned in the paper 146], the publicly available source 
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Table 4.2: Second Matching Results. The best result using only one matching stage is achieved 
with a combination of DIoU and cosine distance: HOTA = 69.41 (Table[3.3}. 


Use Inactive Distance HOTA | Use Inactive Distance HOTA 


x IoU 69.66 y IoU 70.29 
x DIoU 69.70 v DIoU 70.22 
x Cosine 69.68 Jv Cosine 70.22 
x DIoU+Cosine 69.73 v DIoU+Cosine 70.14 


code reveals that only active tracks are considered in the second matching stage. 
In this study, it is also tested whether the inclusion of inactive tracks in this stage 
can be beneficial. Resulting HOTA values of the conducted experiments related 
to the second matching stage can be found in Table[4.2] 


In contrast to [46], appearance-based distances like the cosine distance and the 
combination with DIoU also achieve good results. Compared to the baseline, 
where only one matching stage is used (HOTA = 69.41), gains up to 0.32 HOTA 
are obtained. Note that the applied distance threshold of the second stage dmax,2 
influences the performance, so it is tuned carefully for each configuration. 


When additionally inactive tracks are used, IoU-based matching results in 70.29 
HOTA which is a huge improvement compared to using only active tracks in 
the second matching stage. It is observed that the optimized distance threshold 
dmax,2 is much lower than in the implementation of (0.19 vs. 0.5). Setting 
such a low threshold ensures that only inactive tracks with accurately predicted 
locations can be matched. With the usage of inactive tracks, no prior constraints 
are applied that could limit the matching accuracy, similar as the matching 
cascade of DeepSORT (see Section 4.1). Since the IoU-based matching in 
the second stage yields an improvement of 0.88 HOTA in comparison to the 
one-stage baseline, it is leveraged in all further experiments. 
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Table 5.1: Parameter Tuning Results. 


tmax N Strategy Sinit Strack dmax, 1 dmax,2 HOTA 
Before Tuning 30 10 mean+min 0.7 0.6 3.18 0.19 70.29 
After Tuning 28 16 mean 0.7 06 3413 0.19 70.77 


5 Parameter Tuning and Sensitivity 


Before evaluating different motion models to develop a baseline tracker for this 
study, some parameters had to be set initially: the number of frames an inactive 
track is kept (imax), confidence thresholds for detections to be considered in the 
association and to start new tracks (Strack and Sinit) and the maximum distance 
threshold to prevent unlikely assignments (dyax,1). Extending the tracking 
framework with additional components, further parameters are introduced. 
Integrating appearance features (Section 3.2), the number of past time steps in 
the feature bank (N) and the strategy how to calculate the cosine distance (min, 
mean, mean+min) have to be chosen. With the utilization of a second matching 
stage (Section|4.2), another maximum distance threshold has to be set (dmax,2). 
Since the number of parameters has increased during this study, some might 
not be set optimal anymore. For this reason, an extensive grid search has been 
performed to find the best parameter configuration of the tracker. The results 
are summarized in Table[5.1] whereby parameters that have changed are bold. 
Optimizing the set of parameters gives a notable plus of 0.48 HOTA. 


To get a better understanding of the importance and the sensitivity of the tracking 
parameters, hundreds of experiments have been conducted in that each parameter 
has been varied within a decent interval around the best value (specified by 
grid search), while all the other parameters were fixed at their optimum. The 
resulting HOTA curves are shown in Figure 5.1] 


The confidence threshold sinit of a detection to initialize a new track obviously 
has a large influence on the tracking performance. With a too low threshold, many 
false positives are introduced, whereas with a too large threshold, many targets 
are missed. The track threshold Strack decides, whether a detection is considered 
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Figure 5.1: Sensitivity of Tracking Parameters. 


in the first association stage or the second. Priority is given to detections with 
confidence above Strack and in the second stage, a stricter maximum distance is 
enforced for the lower-confident detections. In the experiments, Strack = 0.6 
achieved the best results. This value is 0.1 smaller than Sinit, which equals the 


relation in [46]. 


Another important parameter is imax. The higher the value, the longer the 
occlusions that can be bridged. If this so-called inactive patience, however, is 
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too high, wrong assignments to inactive tracks can occur, since the location 
accuracy decreases over time. For the number of appearance features in the 
feature bank, the empirically found best value is N = 16. If only a few features 
are considered, the full potential of the temporal information is not leveraged, 
whereas features from too far in the past might not be representative anymore 
due to changes in appearance. 


The best values for the matching thresholds dmax,ı and dmax,2 are 3.13 and 0.19, 
respectively, on MOT17 Val. Too small values prevent correct assignments while 
too large values allow wrong assignments. The fluctuations in the corresponding 
HOTA curves are caused by the small depicted HOTA ranges and in addition — 
like for all parameters — are attributable to the finite dataset size. 


6 _ Post-processing 


The so-far developed tracking framework works fully online which means that 
the tracking results are final after processing each frame of the input video. Some 
applications without real-time requirements allow to refine the tracking results 
with post-processing techniques to improve the performance. Besides simple 
linear interpolation of fragmented tracks, two more sophisticated post-processing 
methods introduced in StrongSORT are investigated — the Appearance Free 
Link (AFLink) model and Gaussian Smoothed Interpolation (GSI). 


AFLink is a small convolutional neural network that takes the center positions 
and corresponding frames of two tracks as input and computes a connectivity 
score solely based on spatio-temporal information. If this connectivity score 
is higher than a threshold and some spatio-temporal constraints are fulfilled, 
the two tracks are linked hypothesizing that they belong to the same target. 
Implementation details can be found in the StrongSORT paper 10]. 


Since the maximum gap in a fragmented track is imax = 28 (see Table 5-1), 
which corresponds to roughly one second on MOT17, many of those gaps can 
be successfully filled with linear interpolation (LI). However, in some cases 
the linear approximation is not accurate enough. Therefore, GSI employs 
Gaussian process regression to model non-linear motion of targets. Another 
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Table 6.1: Post Processing Results. 


AFLink Interpolation HOTA | AFLink Interpolation HOTA 


advantage compared to the linear interpolation is that the noisy trajectories are 
smoothed. It is referred to for details of the GSI algorithm. 


Table[6. I|depicts the post-processing results after application of AFLink as well 
as linear and Gaussian smoothed interpolation. AFLink slightly improves HOTA 
by 0.07 points. Since the model does not integrate appearance information, 
strict spatio-temporal constraints have to be enforced to prevent wrong connec- 
tions. For potentially larger improvements, more sophisticated approaches like 
ReMOT could be applied which is left for future work. Based on appearance 
features enhanced by self-supervised learning, tracks are not only merged in 
[22], but erroneous tracks consisting of different targets are additionally cut 
apart. Looking at the results of the two interpolation techniques, it is observed 
that both significantly improve the overall performance with gains of 1.68 and 
1.97 points in HOTA for LI and GSI, respectively. As expected, the non-linear 
GSI outperforms the simple linear interpolation. 


7 Ablation Study 


In this work, several components related to the association task in MPT have been 
investigated and a strong tracking framework based on the TBD paradigm has 
been developed. Starting from a simple baseline with standard Kalman filter (KF) 
for track propagation and IoU distance as association metric, extensions of the 
KF and a camera motion compensation (CMC) module were introduced. Then, 
motion-based matching was combined with appearance-based matching leading 
to a sophisticated distance measure. Afterwards, low-confident detections were 
integrated into the association within a second matching stage. Finally, parameter 
tuning and post-processing were performed. All these steps lead to consistent 
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Table 7.1: Ablation Study. Abbreviations: CMC = camera motion compensation, NSA+HP = Noise 
Scale Adaptive Kalman filter + height preservation, DloU+Cosine = Distance IoU + cosine distance, 
PT = parameter tuning, PP = post-processing (AFLink + Gaussian smoothed interpolation). 


CMC NSA+HP DIoU+Cosine 2”! Matching PT PP HOTA 


X X x x X X 67.40 (+0.00) 
v X x x X X 68.13 (+0.73) 
v v x X X X 68.67 (+0.54) 
v v v x X X 69.41 (+0.74) 
v v v y X X 70.29 (+0.88) 
v v y v V X 70.77 (+0.48) 
v v y v V Vv 72.81 (+2.16) 


improvements of the overall tracking performance measured in HOTA that are 
summarized in Table [7.1] Besides the offline post-processing, the largest gains 
in the online tracker come from the second matching stage (+0.88 HOTA), the 
combined distance measure (+0.74 HOTA), and the CMC model (+0.73 HOTA). 
All components together boost HOTA significantly from 67.40 to 72.81. 


8 Comparison with the State-of-the-Art 


The final tracker of this study is named StrongTBD because of the large 
improvements w.r.t. the TBD baseline from Section] StrongTBD is compared 
to the state-of-the-art on MOT17 and MOT20 [6] test splits in this section. 
Before delving into the results, it should be noted that annotations of the 
test splits are not publicly available and evaluation is done by submitting the 
tracking results to the official server (motchallenge.net). Besides HOTA, other 
performance measures such as MOTA [3] and IDF1 are also computed. To 
prevent parameter tuning on the test data, one is restricted to four submissions. 
However, the tracking performance is highly dependent on the setting of some 
parameters, especially on the detection thresholds sini; and Strack (see Section|5p. 
For example, changing Sinit and Strack from 0.7 to 0.4 and 0.6 to 0.3, respectively, 
MOTA increases by approximately 10 points on the MOT20-08 sequence in the 


79 


Daniel Stadler 


Table 8.1: State-of-the-Art Methods on MOT17. 


Method MOTA IDF! HOTA FP FN IDSW 
MAATrack 30] 79.4 75.9 62.0 37320 77661 1452 
RTU++ 79.5 79.1 63.9 29508 84618 1302 
StrongSORT 110 79.6 79.5 64.4 27876 86205 1194 
SAT 80.0 79.8 64.4 25125 86505 1356 
ByteTrack [46 80.3 77.3 63.1 25491 83721 2196 
QuoVadis [7 80.3 77.7 63.1 25491 83721 2103 
FOR_Tracking 80.4 71.1 63.6 28674 79452 2298 
BoT-SORT 80.5 80.2 65.0 22521 86037 1212 
ByteTrackV2 [28] 80.6 78.9 63.6 35208 73224 1239 
StrongTBD 81.6 80.8 65.6 24171 78759 954 


Table 8.2: Values of Sinit on MOT17 and MOT20 test sets. 


MOTI7 01 03 06 07 08 12 14 |MOT20 04 06 07 08 
Sinit 0.8 0.75 0.75 0.7 0.7 0.8 0.65| Sint 0.7 0.4 0.7 0.4 


submissions of StrongTBD. This and the fact that some works do not report 
their applied thresholds makes a fair comparison among methods difficult. The 
trend of using various thresholds for different sequences of the datasets 
further complicates the comparison. 


Nevertheless, Table [8. 1]lists the 10 best performing trackers on MOT17 with 
ascending MOTA values. StrongTBD achieves the highest values in MOTA, 
IDF1, and HOTA. Furthermore, it has the least number of identity switches 
(IDSW). Despite the aforementioned comparability issues, the results show 
that the developed tracker can compete with the state-of-the-art. To make 
these results reproducible, the s;n;t values of the submission for the sequences 
of MOT17 are reported in Table Note that for the tracking thresholds 


Strack = Sinit — 0.1 holds, just as in [1][28][46]. 
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Table 8.3: State-of-the-Art Methods on MOT20. 


Method MOTA IDFI HOTA FP FN IDSW 
SAT [37 75.0 76.6 62.6 15549 113136 816 
OC-SORT [5 75.7 76.3 62.4 19067 105894 942 
RTU++ [36 76.5 76.8 62.8 19247 101290 971 
FOR_Tracking [23 76.8 76.4 614 27112 91254 1443 
ByteTrackV2 77.3 75.6 614 22867 93409 1082 
ReMOT [42 77.4 73.1 61.2 28351 86659 1789 
ByteTrack [46 77.8 752 613 26249 87594 1223 
QuoVadis [7 77.8 75.7 61.5 26249 87594 1187 
BoT-SORT [1 77.8 77.5 633 24638 88863 1313 


StrongTBD 78.0 710 63.6 25473 87330 1101 


Table[8.2]also shows the values of Sinit on the final submission on the MOT20 
dataset. The results on this benchmark of the 10 best performing trackers are 
given in Table[8.3] StrongTBD obtains the highest MOTA and HOTA as well as 
the second highest IDF1, which confirms the competitiveness of the developed 
tracking framework. Note that the parameter configuration of StrongTBD has 
been adapted on the MOT20 dataset in order to be more comparable to the 
second best entry BoI-SORT (1). More precisely, the input resolution of the 
MOT20-04 and MOT20-07 sequences are set to 1600x896 pixels, while a 
resolution of 1920x736 pixels is used in MOT20-06 and MOT20-08. In addition, 
an IoU distance threshold of 0.7 is integrated, which helps to prevent IDSW 
in crowded scenes. Furthermore, the same initialization strategy as in 
is followed, in that new tracks are tentative until they get confirmed with an 
assigned detection in the subsequent frame. As already discussed in Section[2| 
such a strategy is beneficial if the threshold sinit is quite low which is the case 
for MOT20-06 and MOT20-08 (see Tableß.2). The target density on MOT2O 
with 127 persons per image is much higher than on MOT17 with only 21.1 
persons per image B2]. As StrongTBD has been developed on MOT17 Val, 
some design choices are not optimal for very crowded scenes as in MOT20. In 
the future, more focus should be put on tracking in such challenging scenarios. 
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9 Conclusion 


In this study, all components of the association task in MPT have been analyzed 
in detail. Two of the most important findings are that the combination of 
motion- and appearance-based distance measures outperforms the sole usage of 
one information type and that leveraging low-confident detections in a second 
association stage yields significant improvements. The influence of various 
tracking components from motion models to post-processing techniques has 
been investigated as well as the sensitivity of the results to the setting of 
tracking parameters. The empirical results were used to develop a sophisticated 
tracking-by-detection method that achieves state-of-the-art performance on the 
two challenging MPT benchmarks MOT 17 and MOT20. Further potential lies 
in enhancing the association accuracy in very crowded scenes as in the MOT20 
dataset, which sould be investigated more thoroughly in the future. 
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Abstract 


Fine-grained vehicle classification is an important task particularly for security 
applications like searching for cars of suspects who abuse stolen license plates. 
However, data privacy and the large number of existing car models render 
it highly difficult to create a large up-to-date dataset for fine-grained vehicle 
classification with surveillance images. While a large number of images of 
vehicles are available in the web due to car selling sites, they have a perspective 
which is vastly different to surveillance images. Domain adaptation is the 
field of research that uses domain-wise inappropriate images for training of 
classification models with the target of running accurate inference on images 
of a different domain. Since the widely considered unsupervised and semi- 
supervised domain adaptation settings are unrealistic for fine-grained vehicle 
classification, we establish a baseline for cross-domain fine-grained vehicle 
classification in a supervised partially zero-shot setting. Our results indicate 
that existing domain adaptation methods like domain adversarial training and 
triplet loss are still advantageous for this setting and we show the benefit of 
distance-based classification for this task. 
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1 Introduction 


Fine-grained classification tasks like vehicle make and model recognition are 
relying on large datasets for training. These are needed since the small inter-class 
variance compared to the large intra-class variance are required to be properly 
approximated by the learned model. While in the web, a large amount of images 
for different cars are provided by e.g. car selling sites, fine-grained classification 
is often applied in different domains. For example, vehicle make and model 
recognition is useful for security applications like manhunt when applied to 
cameras on highways which provide a surveillance perspective. However, for 
these perspectives, the availability of data is scarce. The situation is worsened 
by the high rate of car manufacturers proposing new vehicle models . 


To approach the lack of data, domain adaptation methods can enable the use of 
the large-scale availability of data of different domains like web-nature images to 
perform tasks like classification in domains which have a limited availability of 
data like surveillance. While domain adaptation has been widely approached 
and also specifically for fine-grained classification 
applications, an unsupervised or semi-supervised domain adaptation 
setting is commonly assumed. In these settings, a large number of images is 
present in the target domain for all classes but the labels aren’t present for any or 
only a part of the images. However, for real-world use-cases, the assumption to 
have data for all classes is hard to fulfill since it can only be assured if labels 
would be present. Thus, we focus on a different domain adaptation setting: a 
supervised partially zero-shot setting [B8]. This setting assumes that for a large 
number of base classes, images and labels are available for both domains while 
for a small number of novel classes, images and labels are only available for the 
source domain. For these novel classes, no images are available at all for the 
target domain during training. However, the evaluation on these novel classes 
with images from the target domain is the main focus of the setting. 


Since the research for such a setting is rather small [82], we provide an extensive 
evaluation of existing domain adaptation methods to find a good baseline for 
further research. Besides the widely applied domain adversarial learning (8). 
we explore the use of metric learning with a triplet loss which also has shown 
advantages for classification across domains (20] 32}. 
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Based on these experiments, we found that a typical softmax classifier only 
achieves a low classification accuracy for the novel classes. However, a domain 
adversarial loss heavily increases the accuracy. A distance-based classifier with 
a combination of a cross entropy loss and a triplet loss showed promising results 
which can further be improved by the use of a domain adversarial loss resulting 
in the overall best model. 


In Section] existing works in the fields of fine-grained classification, cross- 
domain classification and cross-domain fine-grained classification are introduced. 
In Section[3. 1] the evaluated methods are described and the evaluation results 
are shown in Section 4] A conclusion of this work is given in Section [f] 


2 Related work 


In this chapter, an overview of the literature in the fields of fine-grained 
classification and cross-domain classification as well as works which employ 
cross-domain classification for fine-grained classification tasks is given. 


2.1 Fine-grained classification 


Various approaches have been used to improve the accuracy for fine-grained 
classification. While all recent approaches share their basis of deep neural 
networks, there are several different extensions and they can be structured 
into the following categories. Part-based models first detect relevant regions 
like specific parts of a vehicle before the crops of these parts are fed into a 
convolutional neural network (CNN) 41]. This reduces the feature 
space to significant parts and thus, reduces the risk of overfitting. Bilinear 
CNNs employ two networks to separate the localization and the extraction 
of important features. The networks are combined by calculating the outer 
product of both resulting feature vectors (23}. Several extensions have been 
proposed to improve the accuracy and efficiency of bilinear CNNs [9]{19]/43}. 
Multiple authors employ multi-task learning by learning an auxiliary task 
like predicting the viewpoint of the image that provides support for the main 
task of fine-grained classification. The auxiliary task is performed only during 
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training to improve the learned features [3] or also during inference to provide 
the network with additional information 2479]. Hierarchical classification 
exploits that fine-grained categories are usually defined on multiple layers, e.g. 
make, model and year of a car. This technique was explored by training multiple 
layers of the hierarchy in a round-robin manner and by training cascaded 
classifiers DI. Metric learning has also been applied to improve the features 
by minimizing intra-class variance and maximizing inter-class variance 
42]. Temporal classification uses videos as input modality for fine-grained 
object recognition instead of single images as done by most works. 
Webly-supervised classification gathers additional data from the web with 
image databases like Flickr providing images with additional meta information 
that can be used for defining labels [6] 39]. 


2.2 Cross-domain classification 


Domain adaptation is usually employed if classification has to be done in a 
domain for which a lack of data exists. The lack of data can be in the form of 
missing images or missing annotations. Mostly an unsupervised scenario is 
considered which contains abundant but unlabaled data for the target domain. 
To approach a cross-domain setting, multiple methods have been proposed. 
We follow the taxonomy of Wang and Deng for the categorization of the 
approaches. Discrepancy-based domain adaptation methods are based on 
a criterion during fine-tuning to increase the accuracy for the target domain. 
Proposed criteria are class-based #5]. statistic-based [48], architecture- 
based [22] or geometry-based (4). Adversarial-based domain adaptation 
methods target a domain confusion of the trained network which disables the 
possibility of exploiting the domain of an image for the classification decision. 
This can be done by generative approaches which transform the appearance of a 
source sample such that it can not be distinguished from the distribution of target 
samples (25). Non-generative approaches have also been explored by using 
domain adversarial training with a domain classifier that is preceded by a gradient 
reversal layer during training. This leads to features which are invariant in regard 
to distinguishing the domains. Reconstruction-based domain adaptation 
methods reconstruct samples from either domain to the other domain to create a 


90 


A Baseline for Cross-Domain Fine-Grained Vehicle Classification 


domain-invariant representation. This has been explored by using a combination 
of an encoder and a decoder |11] as well as using a Cycle-GAN [47] that keeps 
semantic information intact by using a cycle-consistency constraint [14]. 


2.3 Cross-domain fine-grained classification 


Some researchers have already addressed fine-grained classification in a cross- 
domain setting. Gebru et al. exploit the hierarchical nature of fine-grained 
classification by adding an attribute consistency loss that enforces a matching of 
coarse-grained attributes like vehicle types to the fine-grained category. With 
the coarse-grained attribute prediction being a significantly easier task, it is more 
domain invariant and thus, supports stabilizing the fine-grained prediction due to 
the new consistency loss. Tzeng et al. and Wang et al. also exploit the 
attribute and coarse-grained labels inherent to fine-grained classification tasks 
to improve the domain adaptation. Wang et al. extends adversarial domain- 
level adaptation by a category-level domain alignment for semi-supervised 
domain adaptation. Additionally, a part-wise classification to optimize the 
fine-grained classification accuracy is introduced. Yu et al. achieve a class 
confusion by training separate class labels for each domain in a pre-training 
phase and swapping the class labels in a fine-tuning phase with the target of 
achieving domain confusion while compared to domain adversarial training, 
keeping the class-separability of the features intact during the adaption process. 


3 Methods 


In this section, the evaluated methods are described. They can be mainly 
divided by the type of classification. We evaluate a softmax classifier and a 
distance-based classifier. As feature extracting backbone, we use ResNet-50 
for both variants. On top of both variants, we evaluate the usage of domain 
adversarial training [8] to improve the domain invariance. While only common 
for distance-based classification, we evaluate a triplet loss for both variants 
due to the reported advantages in regards to cross-domain classification Pi]. 
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3.1 Softmax classifier 


The softmax classifier employs a fully-connected layer to predict as many 
logits as number of classes and afterwards applies a softmax activation layer 
to normalize the scores. On top of this output, a cross entropy loss is used to 
calculate an error measurement. 


Additionally, we evaluate the use of a domain adversarial head and an auxiliary 
triplet loss to improve the domain invariance of the features. Both additions are 
applied directly on the features of the backbone. 


3.2 Distance-based classifier 


For the distance-based classifier, during inference, we feed each preprocessed 
image into the backbone network and calculate the distance between the feature 
vector of the sample and a prototype feature vector for each class. We choose 
the class as final prediction for which the distance has the lowest value. The 
prototype is calculated as the mean of all training samples of a class from the 
source domain. We also evaluated the use of a medoid instead of a mean but the 
results indicated an advantage for the mean. Regarding the distance measure, 
we evaluated the euclidean norm and the negative cosine similarity with the 
results showing a clear advantage for the negative cosine similarity while the 
euclidean norm usually prevented the network from converging properly. Since 
the cosine similarity is originally a similarity instead of a distance measure, we 
use the negative of the cosine similarity as distance measure. The classification 
can be described by the following formulas: 
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where pe is the feature prototype for the class c, Xe is the set of training images 


of a class c from the source domain, f is the backbone feature extractor, c() is 
the predicted class for an image x and C is the set of known classes. 
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During training, we apply a cross entropy loss with a softmax activation on top 
of a fully-connected layer. Since the cross entropy loss tends to learn features 
which are highly dependent on the domain, we use a triplet loss as additional loss 
function that regularizes the network in regards to the domains. Additionally, 
the triplet loss ensures that the chosen distance measure is appropriate for the 
features during inference. After training, the fully-connected layer is dropped 
and the extracted features are directly used as described above. 


3.3 Domain adversarial training 


Ganin et al. proposed a domain adversarial training method. It applies 
a simple domain classifier on top of the features extracted by the backbone 
and inserts a gradient reversal layer between the network and the domain 
classifier. The gradient reversal layer leads to learning features which are most 
inappropriate for a classification of the domain and thus, the features are expected 
to be invariant in regards to the domain. Therefore, the classification loss which 
is applied in parallel will focus on learning features which are inherent to the 
class instead of exploiting the domain. 


For the domain classification head, we employ two hidden fully-connected layers 
with 1024 channels with each being followed by a ReLU activation and a batch 
normalization layer. A final fully-connected layer with a single output channel 
which is followed by a sigmoid activation predicts the domain. A binary cross 
entropy is applied as training loss for the domain classification. 


The gradient reversal layer includes a gating that controls the influence of the 
reversed gradient of the domain classification loss onto the main network. We 
call this parameter A. A X of 1 means an unhindered influence while a À of 
0 means that the domain classification has no influence on the main network 
at all. A good choice of A might depend on the current state of training and 
a pre-set value is probably not appropriate. Our results showed that the loss 
coupling of A proposed by Wiedemer et al. was superior to a pre-set value 
and an increasing schedule of A as it was originally proposed for the domain 
adversarial training 8]. The loss coupling sets X for each iteration based on the 
domain classification loss value of the previous iteration. The exact formula is 
A; = exp(-La,-ı) with A; being the set X for the iteration i and La ;—ı being 
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the domain classification loss for iteration ö — 1. This ensures that the domain 
classification only has a strong influence on the main network if the loss is low 
meaning that the domain classifier is able to classify the domain adequately. 
In case of a high domain loss, the domain classifier is not able to classify the 
domains properly and will not provide a good domain adversarial loss. 


3.4 Triplet loss 


A triplet loss explicitly minimizes the distance of features of the same class 
while maximizing the distance of features of different classes with respect to 
a chosen distance measure. While the cross entropy loss also tends to show 
a similar behavior, it only enforces a linear separability of classes which can 
result in features of a single class still being spread in feature space. This can be 
particularly dramatic for cross-domain scenarios for which the distribution of 
images is different between training and inference. Thus, we apply a triplet loss 
as additional loss that directly minimizes the distance of features of the same 
class. 


4 Experiments 


We execute quantitative evaluations to find a good baseline for cross-domain 
classification under a supervised partially zero-shot setting. First, the settings 
of the comparisons are described. Afterwards, the results are discussed. The 
comparisons include ablation studies for a softmax classifier, ablation studies 
for a distance-based classifier and a comparison between both approaches. 


4.1 Settings 


The datasets used for the experiments are described first. Afterwards, the 
evaluation metrics and training details are reported. 
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4.1.1 Dataset 


As dataset, we choose CompCars which is one of the largest fine-grained 
vehicle classification datasets available and consists of a web-nature part (Com- 
pCars Web) and a surveillance-nature part (CompCars SV). The CompCars 
Web has a predefined split of 16.016 training images and 14.939 test images. 
The predefined split of the CompCars SV contains 31.148 training images and 
13.333 test images. 


While the CompCars Web is labeled according to the make, model and year of 
a specific car, the CompCars SV is only labeled up to the model of a car and 
lacks the year as annotation. Thus, we also only consider the model for all cars 
in CompCars Web. This results in a total of 431 classes for CompCars Web and 
a total of 281 classes for CompCars SV. We identify the intersection of both sets 
of classes and use only these for our experiments. Thus, we consider a total 
of 181 classes. Based on this set of classes, we create three different random 
splits of base and novel classes with the base classes containing 90% and the 
novel classes containing 10% of the classes. While during training, for the base 
classes abundant labeled images are available in both domains, we restrict the 
availability of data for the novel classes to the source domain of CompCars Web 
and no images from CompCars SV are available for the novel classes. For each 
experiment, a model is trained and evaluated on each split and the results are 
averaged. 


4.1.2 Evaluation metric 


We use the Fl score on the CompCars SV as main metric for our experiments. 
We report the class-wise F1 score averaged over the base and the novel classes 
separately. Since our focus is on adding new classes to the classification, we 
focus mainly on the F1 score of the novel classes. Due to images of all classes 
being included in the test set, base classes still influence the score of the novel 
classes and vice versa. This is sensible since a network only focused on the 
prediction of novel classes should still be able to distinguish them from the total 
of all base classes even when distinguishing the base classes might be of minor 
importance. 
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4.1.3 Training details 


We choose SGD as optimizer with an initial learning rate of 0.04 and a learning 
rate reduction by 10x is applied after 2500 iterations. We apply a momentum 
of 0.9 and a weight decay of 1074. The training is running for 12000 iterations 
in total. A batch size of 512 per GPU with two GPUs is used. Each batch 
contains 256 Web and 256 SV images. We evaluate after every 1000 iterations 
and apply early-stopping by choosing the checkpoint with the highest Fl score 
for novel classes on the CompCars SV images. The weights are initialized 
from a model pre-trained on ImageNet. During training, for each image, a 
crop spanning an area between 8% and 100% of the original image is taken 
randomly and is resized to a size of 224x224 pixels afterwards. Additionally, a 
random horizontal flip is applied with 50% probability. Afterwards, the image is 
normalized using the mean and the standard deviation values of the pre-training 
on ImageNet. For experiments with a triplet loss, we employ hard negative 
mining and a margin of 0.3 since preliminary experiments have shown 
good results for this value. 


4.2 Inference details 


During evaluation, the images are resized such that the shorter side has 256 
pixels while keeping the aspect ratio. Afterwards, a crop of size 224x 224 pixels 
is taken from the center of the resized image. The normalization is applied 
similar to the training configuration. 


4.3 Softmax classification 
We evaluate a softmax classifier as the most common architecture for deep- 
learning-based classification. Since softmax classifiers tend to heavily exploit 


domains in the classification, we explore the use of domain adversarial training 
and an auxiliary triplet loss to improve the domain invariance of the network. 
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Adversarial training A-schedule A base value Base Fl Novel F1 


No - - 95.4 43.0 
Yes Constant 0.1 96.4 66.3 
Yes Increasing 0.1 96.4 66.3 
Yes Increasing 1.0 96.4 65.9 
Yes Coupled 0.1 96.5 67.7 
Yes Coupled 1.0 96.0 66.4 


Table 4.1: Evaluation of different schedules for the A parameter of the domain adversarial training. 
The results indicate a clear advantage for the coupled schedule when focusing on the important 
novel classes. 


4.3.1 Domain adversarial training 


Adversarial domain adaptation is a widely applied approach for domain 
adaptation. In order to find a strong baseline, we evaluate different schedules of 
the parameter that controls the influence of the domain adversarial head onto 
the main network. Besides a constant value and a widely applied monotonically 
increasing schedule [8], the coupled schedule by Wiedemer et al. is also 
evaluated. The set A base value describes the constant value for the constant 
schedule, the maximum value for the increasing schedule and the highest possible 
value (in case of zero domain classification loss) for the coupled schedule. The 
results are shown in Table 


The adversarial training leads to a large improvement of the base F1 score but 
particularly of the novel F1 score with all evaluated schedules for A. While the 
impact of the schedule for is negligible for the base F1 score, for the important 
novel Fl score, the best results are achieved with the coupled schedule and a A 
base value of 0.1. Based on these results, we continue to use these settings for all 
further experiments involving adversarial domain adaptation. The adversarial 
training reduces the impact of the domain onto the features and thus, leads to 
features of novel classes in the target domain being closer to features of the same 
class in the source domain. Therefore, the samples of novel classes in the target 
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Triplet loss Base Fl Novel F1 


No 95.4 43.0 
Yes 95.9 54.5 


Table 4.2: Evaluation of a triplet loss as auxiliary loss for a softmax classifier. The results indicate 
that an auxiliary triplet loss can improve the domain invariance of a softmax classifier. 


domain are classified more accurately which in turn leads to less confusion with 
base classes. Thus, also the base class accuracy is improved. 


4.3.2 Auxiliary triplet loss 


The triplet loss has shown to be more domain invariant than a pure cross entropy 
loss. Thus, we evaluate the impact of an auxiliary triplet loss in Table[#.2] The 
triplet loss uses the negative cosine similarity as distance measure. A training 
with euclidean norm as distance measure did not converge properly since the 
euclidean norm enforces a feature space that is not well suited for the cross 
entropy loss. Thus, results for the euclidean norm are not reported. 


The results show a clear advantage of the triplet loss for the accuracy of the 
base as well as the novel classes. The increase is probably a result of the triplet 
loss forcing a distance of close to zero in feature space for all samples of a class 
and thus, reducing the possibility of a spread due to different domains. While 
this only applies for the base classes in training, it probably also reduces the 
distance of samples of the novel classes between both domains leading to the 
improvement in the novel class accuracy. This improvement then leads to an 
improvement in base class accuracy due to less confusion with novel classes 
occurring. 
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Distance measure Base Fl Novel Fl 
Euclidean norm 8.4 6.2 
Negative cosine similarity 96.0 62.9 


Table 4.3: Comparing distance measures for a distance-based classifier. The negative cosine 
similarity shows a strong advantage with the euclidean norm showing poor results due to the cross 
entropy loss not converging properly. 


4.4 Distance-based classification 


While CNNs are mostly combined with a logit-based classification head, distance- 
based classification and metric learning provide a higher flexibility due to not 
limiting the model to a specific set of classes during training. 


4.4.1 Distance measure 


For the distance-based classification, the choice of the distance measure is a 
crucial parameter. Thus, we compare the use of an euclidean norm as well as 
negative cosine similarity. The respective distance measure is applied for the 
triplet loss as well as for the classification. The results of the comparison are 
shown in Table [4.3] They indicate a strong advantage of the negative cosine 
similarity while the training with the euclidean norm does not properly converge. 
Particularly, the training of the triplet loss with an euclidean norm leads to a 
non-decreasing cross entropy loss. The embedding induced by a triplet loss 
with an euclidean norm seems to be incompatible with a logit-based softmax 
classification and a cross entropy loss. Seemingly, the optimizer can not converge 
to a proper embedding which suits both losses. 


4.4.2 Prototype aggregation 


For the classification, we aggregate all training samples from the source domain 
to estimate a prototype for each class and choose the class whose prototype is 
the closest to the input samples in terms of feature distance. For the aggregation 
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Aggregation BaseFl Novel Fl 


Mean 96.0 62.9 
Medoid 96.0 61.8 


Table 4.4: Comparison of the estimation methods for the class prototype. Using the mean of the 
train samples shows a significant advantage over using the medoid. 


Domain adversarial training Base Fl Novel Fl 


No 96.0 62.9 
Yes 96.3 69.8 


Table 4.5: Evaluation of applying domain adversarial training with distance-based classification. The 
results show that adversarial training can provide an advantage in combination with a distance-based 
classifier. 


of the samples, we evaluate a mean of the features and a medoid of the features. 
The medoid is defined as the sample which has the smallest total distance to all 
other samples. The results are shown in Table While the difference on the 
base classes is negligible, the mean aggregation shows a clear advantage over 
the medoid for the novel classes. 


4.4.3 Domain adversarial training 


While the triplet loss already provides a strong improvement in terms of domain 
invariance for distance-based classification, we evaluate if domain adversarial 
training can still lead to an improved accuracy. Therefore, we apply domain 
adversarial training with the best setting as in the previous ablation studies 
additional to the cross entropy loss and the triplet loss we commonly use for 
the distance-based classifier. The results are shown in Table[4.5]and indicate a 
slight increase in terms of base class accuracy and a high increase in terms of 
novel class accuracy. 
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Method Base Fl Novel Fl 


Softmax classifier 95.4 43.0 
Softmax classifier with adversarial training 96.5 67.7 
Softmax classifier with triplet loss 95.9 54.5 
Distance-based classifier 96.0 62.9 
Distance-based classifier with adversarial training 96.3 69.8 


Table 4.6: Comparison of softmax classifiers with and without domain regularization methods and 
distance-based classification. The results show the advantage of distance-based classification for the 
accuracy of the novel classes while the softmax classifier with domain adversarial training shows a 
slight advantage for the base class accuracy. 


4.5 Comparison of softmax classification and distance-based 
classification 


We compare softmax-based classification methods with and without domain 
adaptation extensions to a distance-based classification method in Table [4.6] 
For the softmax-based classification, a domain adversarial training as well as 
an auxiliary triplet loss is evaluated to improve cross-domain classification 
accuracy. 


While the softmax classifier with the adversarial training shows the highest 
accuracy for the base classes, the distance-based classifier combined with 
a domain adversarial training follows closely behind and has a significant 
advantage in terms of novel class accuracy compared to all evaluated distance- 
based classifiers. Without adversarial training, the softmax classifier shows a 
heavy drop in accuracy particularly of the novel classes. The triplet loss also 
provides a large benefit for the softmax classifier. However, it still shows a large 
accuracy gap when compared to the adversarial loss. 
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5 Conclusion 


In this work, different domain adaptation approaches were evaluated in a 
supervised partially zero-shot setting for fine-grained vehicle classification to 
employ web images as training data for classification on surveillance images. 
The results show the importance of domain adversarial training to achieve 
acceptable results with a softmax-based classifier. However, a distance-based 
classifier employing a combination of a cross entropy loss and a triplet loss still 
show competitive results which can still be improved by domain adversarial 
training. This combination showed the overall best results for the classification 
of the novel classes in our evaluation. 


Evaluation of better backbones as modern vision transformers or state-of- 
the-art convolutional network architectures is up to future work. Other areas 
of future research are improvements directly targeting the supervised partially 
zero-shot setting which have not yet been evaluated for other settings. 
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Abstract 


The ability to anticipate possible human actions in the distant future is of 
fundamental interest for a wide range of applications, including autonomous 
driving, surveillance, and human-robot interaction. Consequently, various 
methods have been presented for action anticipation in recent years, with 
deep learning-based approaches being particularly popular. In this work, we 
give a short overview of the recent advances of long-term action anticipation 
algorithms. 


1 Introduction 


In the last years, we have seen a tremendous progress in the capabilities of 
computer systems to classify and segment activities in videos. These systems, 
however, analyze the past or in the case of real-time systems the present with 
a delay of a few milliseconds. For applications, where a moving system has 
to react or interact with humans, this is insufficient. For instance, to be able 
to offer a hand at the right time or to generate proactive dialog to provide 
more natural interactions, collaborative robots that work closely with humans 
have to anticipate the activities of a human in the future. Compared to human 
action recognition and early action recognition, where entire or part of action 
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segments are observable, action anticipation aims to predict future action without 
observing any part of it, as displayed in Figure[1.1] 


As the anticipation results are just assumptions, this tends to be significantly 
more challenging than traditional action recognition, which performs well 
with todays well-honed discriminative models 7117. Consistent with action 
recognition, anticipation approaches start with prediction on only one single 
video frame and tend to use longer temporal context in recent years. 
Apart from using a long action history, many approaches attempt to leverage 
several modalities other than just the raw video frames, such as the motion 
information and objects contained in the scene, to further improve the predictive 
ability. 


While many recent works anticipate activities only for a very short time horizon 
of a few seconds 9] [8], there is a parallel line of work (6) which addresses 
the problem of anticipating all activities that will be happening within a time 
horizon of up to several minutes, which is particularly interesting for robot 
systems that require certain time to react and plan the future tasks. 


In spite of the enormous amount of research conducted in this area, the problem 
is still challenging due to the fundamental challenges inherent to the task such 
as the multi-modal distribution of future action candidates, especially for the 
scenario where we are going to predict far into the future (long-term anticipation). 
As action recognition is usually a fundamental sub-component of an anticipation 
system, the challenges of action recognition are also included, such as the 
tremendous intra-class variance among the activities, huge spatio-temporal scale 
variation, target motion variations, etc. Moreover, low image resolution, object 
occlusion, illumination change and viewpoint change further aggravate these 
challenges. 


Although classical learning approaches, such as Conditional Random Fields 
(CRFs) [15], Markov models [23], and other statistical methods [19][22], have 
been widely used in the literature, we put our focus on deep learning techniques 
and how they have been extended or applied to daily-living action anticipation, 
leaving the classical approaches outside the scope of the present review. In this 
context, the terms action anticipation, action prediction, and action forecasting 
are used interchangeably. 
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Figure 1.1: The action anticipation task aims to anticipate future action(s) before it happens, 
whereas action recognition and early action recognition require the observation of complete and 
partial actions, respectively. 


This survey is structured as follows. In Section] we describe both short-term 
and long-term anticipation tasks which are commonly used in the literature, so 
that the reader can better distinguish between them. In Section jB] we introduce 
the current approaches that address the long-term anticipation task and discuss 
their limitations. Finally, we conclude this survey in Section f] 


2 Problem Statement 


Based on the prediction time horizon, action anticipation approaches can be 
grouped into two categories: short-term anticipation approaches and long-term 
anticipation approaches. While short-term approaches predict a single action a 
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Figure 2.1: Category of the action anticipation task. While the short-term anticipation aims at 
predicting a single future action, long-term task aims to predict a sequence of the following actions. 


few seconds into the future, long-term approaches predict a sequence of future 
actions with their durations up to several minutes into the future. In the following 
sections, we show the detailed task definition of both categories usually used in 
the literature. 


2.1 Short-term anticipation 


Most short-term anticipation approaches follow the setup defined in 
[5]. As illustrated in Figure [2.1(a)| the task aims to predict a future action by 
observing a video segment of length 7o. The observation segment is Ta seconds 
preceding the action, i.e., from time Ts — (Ta + To) tO Ts — Ta, Where Ta 
denotes the “anticipation time”, i.e., how many seconds in advance actions are 
to be anticipated. The anticipation time 7, is usually fixed for each dataset, 
whereas the length of the observation segment is typically dependent on the 
individual method. Methods in this category typically use synchronous data to 
perform the anticipation task, meaning that the input to the model is a sequence 
of frames that have the same temporal spacing before the action (9} (8). 
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Some work attempts to predict the starting time of the next action 
as well. As this task involves the duration of each action, these approaches 
usually use asynchronous data as input to the model, containing a sequence of 
action categories and inter-arrival times. The inter-arrival time is defined as 
the difference between the starting time of last and the current action. With the 
predicted inter-arrival time, the starting time of the next action can be easily 
deduced. 


2.2 Long-term anticipation 


There is a parallel line of research addressing the long-tern anticipation task, 
which is proposed in (6). The goal is to anticipate the category and the duration 
of future actions for a given time horizon, which can take up to several minutes, 
as illustrated in Figure[2.1(b)} Long-term approaches typically take a sequence 
of observed action categories and their durations to predict another sequence of 


actions and durations {(6]{T] 31]. 


3 Long-term Anticipation Approaches 


3.1 Methods 


Farha et al. [6] first introduced the long-term action anticipation task and 
proposed two models to tackle the task. One is based on an RNN model, which 
outputs the remaining length of the current action, the next action class and its 
length, as shown in Figureß.1] The long-term prediction is conducted recursively, 
i.e., observations are combined with the current prediction to produce the next 
prediction. Another method is based on aCNN model, which outputs a sequence 
of future actions in a form of a matrix in one single step. Considering the 
limitations of these two methods, i.e., the RNN model is time-consuming and 
suffers from error accumulation and the CNN model introduces many parameters 
when predicting long sequences, Ke et al. proposed a method to explicitly 
address these issues. They chose to condition on a time variable representing the 
prediction horizon. Specifically, they transformed the prediction time horizon 
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Figure 3.1: Architecture of the RNN system f6}. The input is a sequence of (length, 1-hot class 
encoding)-tuples. The network predicts the remaining length of the last observed action and the 
label and length of the next action. Appending the predicted result to the original input, the next 
action segment can be predicted. Figure is taken from {6}. 


to a time representation, and concatenated it with the original inputs forming 
time-conditioned observations. Their model is therefore capable of anticipating 
a future action at arbitrary and variable time horizons in a one-shot fashion. 
Additionally, they introduced a time-conditioned skip connection between the 
last observed action and the initial anticipation based on the intuition that the 
last action of the observation is generally relevant to the future actions. 


Inspired by [12], Gongetal. proposed an encoder-decoder structure based on 
transformer architecture PIB]. which effectively captures long-term relations 
over the whole sequence of actions. The encoder learns to capture fine-grained 
long-range temporal relations between the observed frames from the past, while 
the decoder learns a sequence of future action queries, capturing global relations 
between upcoming actions in the future along with the observed features from 
the encoder. Because of the proposed parallel decoding, the model is able to 
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make more accurate and faster inference without potential error accumulations 
caused by autoregressive decoding. However, the number of predictable future 
actions is also limited to the number of action queries used in the training process, 
which might need to be adapted, if the model is applied for other datasets. 


Predicting future is inherently multi-modal. Given an observed video segment 
containing an ongoing action, multiple actions could be possible to be the next 
action following the observed one. This uncertainty becomes even larger if 
we are going to predict far into the future. Therefore, it may be beneficial 
to model the underlying uncertainty, allowing to capture different possible 
future actions. However, in most approaches, action prediction is taken as a 
classification problem and optimized under cross-entropy loss, suffering from 
overly high resemblance to dominant ground truth, while suppressing other 
reasonable possibilities Bl. Moreover, approaches that are optimized with mean 
square error tend to produce the mean of the modes 0]. To this end, 
some approaches are proposed to tackle the uncertainty in the future predictions, 
which are described below. 


Farha and Gall |1| introduced a framework that predicts all subsequent actions 
and corresponding durations in a stochastic manner. In their framework, an 
action model similar to the one proposed in (6) (shown in Figure [3.1) and a 
time model are trained to predict the probability distribution of the future action 
label and duration, respectively. While action labels are taken as classifications 
and optimized under cross-entropy (CE) loss, durations are taken as real-valued 
variables which are modeled with a Gaussian distribution and optimized with the 
negative log likelihood (NLL). At test time, future action label and its duration 
are sampled from the learned distributions. Long-term predictions are achieved 
by feeding the predicted action segment to the model recursively. 


Zhao and Wildes proposed Conditional Adversarial Generative Networks 
to address the underlying uncertainty when predicting future action sequence. 
More specifically, different from many works that operate with continuous time 
variable (| [1] [12} [10], they treated both action labels and time as discrete data 
which are formated as one-hot vectors. These vectors are first projected to 
higher dimension continuous spaces and concatenated, and then fed to a seg2seq 
generator to compute logits of future action labels and their corresponding 
time. To obtain differentiable sampling to generate future sequences with 
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both quality and diversity during training, the Gumbel-Softmax relaxation 
technique that mimics one-hot vectors from categorical distributions and a 
normalized distance regularizer that encourages diversity are adopted. A 
ConvNet classifier is used as the discriminator to allow to train the generator 
adversarially. 


Mehrasaetal. proposed using arecurrent variational auto-encoder (VAE (13) 
to capture the distribution over the times and categories of action sequences. 
To overcome the problem that a fixed prior distribution of the latent variable 
(usually N (0, I) in VAE models) may ignore temporal dependencies present 
between actions, authors learned a prior that varies across time. At test time, 
a latent code is sampled from the learned prior distribution, based on which 
the probability distributions of the action class and the corresponding time are 
inferred. 


3.2 Limitations 


Despite the impressive performance on the standard benchmarks (25]{16}, current 
approaches have several limitations, which are described below. 


Limited representativity of the evaluation datasets. The commonly used 
benchmark datasets for long-term anticipation, i.e., Breakfast and 50Sal- 
ads (25), contain only videos of a specific kitchen activity, which usually last 
several minutes. Since there is only one activity per video, i.e., either preparing 
a breakfast or preparing a salad, it is easier to predict the following actions than 
in the real-world scenarios, where a completely different action might occur next. 
Furthermore, since these videos are typically only several minutes long, the 
current setting may not be directly applicable for longer videos, especially for 
real-world applications. Moreover, these datasets do not contain any concurrent 
actions. However, actions in the real-world scenarios, such as making a phone 
call and taking notes may be performed simultaneously. 


Difficult deployment of methods that incorporate uncertainty. Methods that 
incorporate uncertainty typically learn a joint distribution of all data samples. 
For evaluation, authors usually draw many samples from the learned distribution, 
and compute the average metric value of all drawn samples 1B1]. or select 
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the most frequent sample as the final result [21]. However, such an evaluation 
protocol requires multiple runs of the model, which is time-consuming and 
therefore difficult to deploy for real-time systems. 


4 Conclusion 


In this survey, we gave a short overview of the current approaches that are 
proposed to tackle the long-term action anticipation task. We analyzed different 
methods from two perspectives: research question each individual method 
addresses and method description. In the end, we also described the limitations 
of the current approaches. In conclusion, long-term action anticipation is an 
interesting and relatively new research topic, which attracts increasing attention 
in the community, and benefits many intelligent decision-making systems. While 
great strides have been made, there is still large room for improvement in action 
anticipation using deep learning techniques. 
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